In the digital era, the quality of website content plays a pivotal role in creating a positive user experience and maintaining a professional online presence. Grammatical and spelling errors not only undermine credibility but can also lead to misunderstandings. In this blog, we will explore the process of creating a powerful Grammatical and Spelling Website Checker using Python.
By leveraging Python’s versatility and a set of essential libraries, we can build a tool that not only crawls through a website but also assesses the grammatical and spelling correctness of its content. This can be particularly beneficial for businesses, bloggers, content creators and marketers striving for flawless web communication.
Prerequisites:
Before diving into the code, make sure you have the following installed on your system:
- Python and one IDE: You can download it from Python’s official website.
- JDK: Download from Java Downloads | Oracle (Use for free library access)
Step 1: Setting up the Environment
Install the necessary Python libraries for web crawling, text processing, and language checking:
pip install beautifulsoup4 requests openpyxl language_tool_python
This section imports necessary libraries and modules for web crawling (requests
, BeautifulSoup
), handling Excel files (openpyxl
), working with URLs (urllib.parse
), disabling SSL warnings (urllib3
), and checking grammar and spelling (language_tool_python
). The line sys.setrecursionlimit(3000)
sets a higher recursion limit to accommodate deep recursion during web crawling.
# website_checker.py
import os
import signal
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook
from datetime import datetime
from urllib.parse import urlparse, urljoin
import urllib3
import sys
import language_tool_python
# Increase the recursion limit
sys.setrecursionlimit(3000)
Step 2: Defining Initial Variables
Create a Python script, let’s call it website_checker.py
, and start writing the code. You can use any text editor or integrated development environment (IDE) of your choice.
# Define the website URL to crawl
url = "https://www.example.com" # Replace with your target website URL
main_url = url
main_domain = urlparse(main_url).netloc
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
skip_patterns = [
"https://example.com/skip", # Add more URL patterns to skip here
]
# File path for storing the data
file_path = "website_checker_output.xlsx"
# Initialize LanguageTool
language_tool = language_tool_python.LanguageTool('en-US')
Step 3: Creating Excel Workbook and Worksheet
We create a new Excel workbook using openpyxl
and set up the active worksheet (ws_internal
) where we’ll store the results.
# Create a new workbook
wb = Workbook()
ws_internal = wb.active
Step 4: Defining Headers for the Excel File
We define the column headers for our Excel file, which include details like URL, Rule ID, Message, and so on. These headers will help organize the information in the Excel sheet.
# Define the headers for the Excel file
headers_internal = [
'URL',
'Previous Linking URL',
'Rule ID',
'Message',
'Replacements',
'Offset',
'Error Length',
'Category',
'Rule Issue Type',
'Sentence',
'Context',
]
# Write the headers to the internal worksheet
ws_internal.append(headers_internal)
Step 5: Signal Handler for Interrupt Signal
This section sets up a signal handler to capture the interrupt signal (Ctrl+C). It ensures that if the user interrupts the program, the Excel file will be saved before exiting.
This covers the initial setup and preparation for building the Grammatical and Spelling Website Checker. The subsequent components will delve into the core functionalities such as crawling the website, checking grammar and spelling, and saving the results.
# Signal handler for interrupt signal (Ctrl+C)
def interrupt_handler(signal, frame):
# Save the Excel file
wb.save(file_path)
print('Excel file saved to:', file_path)
exit(0)
# Register the signal handler
signal.signal(signal.SIGINT, interrupt_handler)
Step 6: Defining the Crawler Function
This function, crawl_website
, is the core of the script. It performs the following actions:
- Checks if the URL has been visited or matches skip patterns.
- Verifies if the URL belongs to the same domain.
- Sends a GET request to the URL, retrieves the HTML content, and parses it using BeautifulSoup.
- Checks if the URL is internal and meets the specified pattern.
- Extracts the text content from the page and checks for grammar and spelling errors using LanguageTool.
- Appends the error information to the Excel worksheet and saves the file.
- Recursively crawls the internal links on the page.
# Function to extract data from the website and write to Excel
def crawl_website(url, previous_url='', is_internal=True):
if url in visited_urls:
return
for pattern in skip_patterns:
if url.startswith(pattern):
print("Skipping URL:", url)
return
current_domain = urlparse(url).netloc
if main_domain != current_domain:
is_internal = False
return
visited_urls.add(url)
try:
response = requests.get(url, allow_redirects=False, verify=False)
except requests.exceptions.InvalidSchema:
return
soup = BeautifulSoup(response.content, 'html.parser')
status_code = response.status_code
crawl_timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
if is_internal and url.startswith(main_url + '/xxx/xxx/'): # Adjust the pattern accordingly
worksheet = ws_internal
page_content = soup.get_text()
# Check grammar and spelling errors
grammar_errors = language_tool.check(page_content)
# Append rows for each flattened error information
for error in grammar_errors:
row_data = [
url,
previous_url,
error.ruleId,
error.message,
', '.join(error.replacements),
error.offset,
error.errorLength,
error.category,
error.ruleIssueType,
error.sentence,
error.context,
]
worksheet.append(row_data)
# Save the Excel file after each URL
wb.save(file_path)
if is_internal:
for link in soup.find_all('a'):
href = link.get('href')
if href and href != '/' and not href.startswith('#'):
if href.startswith('/'):
href = urljoin(main_url, href)
if href not in visited_urls:
print(f"Crawling URL: {href}")
crawl_website(href, url, is_internal)
Full Code
Here’s the full code for your reference:
import os
import signal
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook
from datetime import datetime
from urllib.parse import urlparse, urljoin
import urllib3
import sys
import language_tool_python
# Increase the recursion limit
sys.setrecursionlimit(3000)
# Define the website URL to crawl
url = "https://www.example.com" # Replace with your target website URL
main_url = url
main_domain = urlparse(main_url).netloc
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
skip_patterns = [
"https://example.com/skip", # Add more URL patterns to skip here
]
# File path for storing the data
file_path = "website_checker_output.xlsx"
# Remove the existing file if it exists
if os.path.isfile(file_path):
os.remove(file_path)
# Create a new workbook
wb = Workbook()
ws_internal = wb.active
ws_internal.title = 'Internal URLs'
# Define the headers for the Excel file
headers_internal = [
'URL',
'Previous Linking URL',
'Rule ID',
'Message',
'Replacements',
'Offset',
'Error Length',
'Category',
'Rule Issue Type',
'Sentence',
'Context',
]
# Write the headers to the internal worksheet
ws_internal.append(headers_internal)
# Set to store visited URLs
visited_urls = set()
# Initialize LanguageTool
language_tool = language_tool_python.LanguageTool('en-US')
# Signal handler for interrupt signal (Ctrl+C)
def interrupt_handler(signal, frame):
# Save the Excel file
wb.save(file_path)
print('Excel file saved to:', file_path)
exit(0)
# Register the signal handler
signal.signal(signal.SIGINT, interrupt_handler)
# Function to extract data from the website and write to Excel
def crawl_website(url, previous_url='', is_internal=True):
if url in visited_urls:
return
for pattern in skip_patterns:
if url.startswith(pattern):
print("Skipping URL:", url)
return
current_domain = urlparse(url).netloc
if main_domain != current_domain:
is_internal = False
return
visited_urls.add(url)
try:
response = requests.get(url, allow_redirects=False, verify=False)
except requests.exceptions.InvalidSchema:
return
soup = BeautifulSoup(response.content, 'html.parser')
status_code = response.status_code
crawl_timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
if is_internal and url.startswith(main_url + '/xxx/xxx/'): # Adjust the pattern accordingly
worksheet = ws_internal
page_content = soup.get_text()
# Check grammar and spelling errors
grammar_errors = language_tool.check(page_content)
# Append rows for each flattened error information
for error in grammar_errors:
row_data = [
url,
previous_url,
error.ruleId,
error.message,
', '.join(error.replacements),
error.offset,
error.errorLength,
error.category,
error.ruleIssueType,
error.sentence,
error.context,
]
worksheet.append(row_data)
# Save the Excel file after each URL
wb.save(file_path)
if is_internal:
for link in soup.find_all('a'):
href = link.get('href')
if href and href != '/' and not href.startswith('#'):
if href.startswith('/'):
href = urljoin(main_url, href)
if href not in visited_urls:
print(f"Crawling URL: {href}")
crawl_website(href, url, is_internal)
# Call the crawl_website function for the main URL
try:
counter = 0
crawl_website(url)
except Exception as e:
print(f"An error occurred during the crawl: {str(e)}")
# Save the Excel file
wb.save(file_path)
print('Excel file saved to:', file_path)
Conclusion
By combining web crawling, natural language processing, and data storage capabilities, this Python script provides a valuable resource for maintaining high-quality and error-free textual content on websites. The script can be further extended with additional features such incorporating more advanced language processing tools, or scan only specific parts such as <body> tag etc.
AIHelperHub is an expert in AI-focused SEO/Digital Marketing automation and Python-based SEO automation and enjoys teaching others how to harness it to futureproof their digital marketing. AIHelperHub provides comprehensive guides on using AI tools and python to automate your SEO efforts, create more personalized ad campaigns, automate content journey, and more.