A Guide to Building a Grammatical and Spelling Website Checker in Python

In the digital era, the quality of website content plays a pivotal role in creating a positive user experience and maintaining a professional online presence. Grammatical and spelling errors not only undermine credibility but can also lead to misunderstandings. In this blog, we will explore the process of creating a powerful Grammatical and Spelling Website Checker using Python.

By leveraging Python’s versatility and a set of essential libraries, we can build a tool that not only crawls through a website but also assesses the grammatical and spelling correctness of its content. This can be particularly beneficial for businesses, bloggers, content creators and marketers striving for flawless web communication.

Prerequisites:

Before diving into the code, make sure you have the following installed on your system:

Python and one IDE: You can download it from Python’s official website.
JDK: Download from Java Downloads | Oracle (Use for free library access)

Step 1: Setting up the Environment

Install the necessary Python libraries for web crawling, text processing, and language checking:

pip install beautifulsoup4 requests openpyxl language_tool_python

This section imports necessary libraries and modules for web crawling (requests, BeautifulSoup), handling Excel files (openpyxl), working with URLs (urllib.parse), disabling SSL warnings (urllib3), and checking grammar and spelling (language_tool_python). The line sys.setrecursionlimit(3000) sets a higher recursion limit to accommodate deep recursion during web crawling.

# website_checker.py

import os
import signal
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook
from datetime import datetime
from urllib.parse import urlparse, urljoin
import urllib3
import sys
import language_tool_python

# Increase the recursion limit
sys.setrecursionlimit(3000)

Step 2: Defining Initial Variables

Create a Python script, let’s call it website_checker.py, and start writing the code. You can use any text editor or integrated development environment (IDE) of your choice.

# Define the website URL to crawl
url = "https://www.example.com"  # Replace with your target website URL
main_url = url
main_domain = urlparse(main_url).netloc

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

skip_patterns = [
    "https://example.com/skip",  # Add more URL patterns to skip here
]

# File path for storing the data
file_path = "website_checker_output.xlsx"

# Initialize LanguageTool
language_tool = language_tool_python.LanguageTool('en-US')

Step 3: Creating Excel Workbook and Worksheet

We create a new Excel workbook using openpyxl and set up the active worksheet (ws_internal) where we’ll store the results.

# Create a new workbook
wb = Workbook()
ws_internal = wb.active

Step 4: Defining Headers for the Excel File

We define the column headers for our Excel file, which include details like URL, Rule ID, Message, and so on. These headers will help organize the information in the Excel sheet.

# Define the headers for the Excel file
headers_internal = [
    'URL',
    'Previous Linking URL',
    'Rule ID',
    'Message',
    'Replacements',
    'Offset',
    'Error Length',
    'Category',
    'Rule Issue Type',
    'Sentence',
    'Context',
]

# Write the headers to the internal worksheet
ws_internal.append(headers_internal)

Step 5: Signal Handler for Interrupt Signal

This section sets up a signal handler to capture the interrupt signal (Ctrl+C). It ensures that if the user interrupts the program, the Excel file will be saved before exiting.

This covers the initial setup and preparation for building the Grammatical and Spelling Website Checker. The subsequent components will delve into the core functionalities such as crawling the website, checking grammar and spelling, and saving the results.

# Signal handler for interrupt signal (Ctrl+C)
def interrupt_handler(signal, frame):
    # Save the Excel file
    wb.save(file_path)
    print('Excel file saved to:', file_path)
    exit(0)

# Register the signal handler
signal.signal(signal.SIGINT, interrupt_handler)

Step 6: Defining the Crawler Function

This function, crawl_website, is the core of the script. It performs the following actions:

Checks if the URL has been visited or matches skip patterns.
Verifies if the URL belongs to the same domain.
Sends a GET request to the URL, retrieves the HTML content, and parses it using BeautifulSoup.
Checks if the URL is internal and meets the specified pattern.
Extracts the text content from the page and checks for grammar and spelling errors using LanguageTool.
Appends the error information to the Excel worksheet and saves the file.
Recursively crawls the internal links on the page.

# Function to extract data from the website and write to Excel
def crawl_website(url, previous_url='', is_internal=True):
    if url in visited_urls:
        return

    for pattern in skip_patterns:
        if url.startswith(pattern):
            print("Skipping URL:", url)
            return

    current_domain = urlparse(url).netloc

    if main_domain != current_domain:
        is_internal = False
        return

    visited_urls.add(url)

    try:
        response = requests.get(url, allow_redirects=False, verify=False)
    except requests.exceptions.InvalidSchema:
        return

    soup = BeautifulSoup(response.content, 'html.parser')

    status_code = response.status_code

    crawl_timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

    if is_internal and url.startswith(main_url + '/xxx/xxx/'):  # Adjust the pattern accordingly
        worksheet = ws_internal
        page_content = soup.get_text()

        # Check grammar and spelling errors
        grammar_errors = language_tool.check(page_content)

        # Append rows for each flattened error information
        for error in grammar_errors:
            row_data = [
                url,
                previous_url,
                error.ruleId,
                error.message,
                ', '.join(error.replacements),
                error.offset,
                error.errorLength,
                error.category,
                error.ruleIssueType,
                error.sentence,
                error.context,
            ]
            worksheet.append(row_data)

        # Save the Excel file after each URL
        wb.save(file_path)

    if is_internal:
        for link in soup.find_all('a'):
            href = link.get('href')
            if href and href != '/' and not href.startswith('#'):
                if href.startswith('/'):
                    href = urljoin(main_url, href)
                if href not in visited_urls:
                    print(f"Crawling URL: {href}")
                    crawl_website(href, url, is_internal)

Full Code

Here’s the full code for your reference:

import os
import signal
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook
from datetime import datetime
from urllib.parse import urlparse, urljoin
import urllib3
import sys
import language_tool_python

# Increase the recursion limit
sys.setrecursionlimit(3000)

# Define the website URL to crawl
url = "https://www.example.com"  # Replace with your target website URL
main_url = url
main_domain = urlparse(main_url).netloc

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

skip_patterns = [
    "https://example.com/skip",  # Add more URL patterns to skip here
]

# File path for storing the data
file_path = "website_checker_output.xlsx"

# Remove the existing file if it exists
if os.path.isfile(file_path):
    os.remove(file_path)

# Create a new workbook
wb = Workbook()
ws_internal = wb.active
ws_internal.title = 'Internal URLs'

# Define the headers for the Excel file
headers_internal = [
    'URL',
    'Previous Linking URL',
    'Rule ID',
    'Message',
    'Replacements',
    'Offset',
    'Error Length',
    'Category',
    'Rule Issue Type',
    'Sentence',
    'Context',
]

# Write the headers to the internal worksheet
ws_internal.append(headers_internal)

# Set to store visited URLs
visited_urls = set()

# Initialize LanguageTool
language_tool = language_tool_python.LanguageTool('en-US')

# Signal handler for interrupt signal (Ctrl+C)
def interrupt_handler(signal, frame):
    # Save the Excel file
    wb.save(file_path)
    print('Excel file saved to:', file_path)
    exit(0)

# Register the signal handler
signal.signal(signal.SIGINT, interrupt_handler)

# Function to extract data from the website and write to Excel
def crawl_website(url, previous_url='', is_internal=True):
    if url in visited_urls:
        return

    for pattern in skip_patterns:
        if url.startswith(pattern):
            print("Skipping URL:", url)
            return

    current_domain = urlparse(url).netloc

    if main_domain != current_domain:
        is_internal = False
        return

    visited_urls.add(url)

    try:
        response = requests.get(url, allow_redirects=False, verify=False)
    except requests.exceptions.InvalidSchema:
        return

    soup = BeautifulSoup(response.content, 'html.parser')

    status_code = response.status_code

    crawl_timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

    if is_internal and url.startswith(main_url + '/xxx/xxx/'):  # Adjust the pattern accordingly
        worksheet = ws_internal
        page_content = soup.get_text()

        # Check grammar and spelling errors
        grammar_errors = language_tool.check(page_content)

        # Append rows for each flattened error information
        for error in grammar_errors:
            row_data = [
                url,
                previous_url,
                error.ruleId,
                error.message,
                ', '.join(error.replacements),
                error.offset,
                error.errorLength,
                error.category,
                error.ruleIssueType,
                error.sentence,
                error.context,
            ]
            worksheet.append(row_data)

        # Save the Excel file after each URL
        wb.save(file_path)

    if is_internal:
        for link in soup.find_all('a'):
            href = link.get('href')
            if href and href != '/' and not href.startswith('#'):
                if href.startswith('/'):
                    href = urljoin(main_url, href)
                if href not in visited_urls:
                    print(f"Crawling URL: {href}")
                    crawl_website(href, url, is_internal)

# Call the crawl_website function for the main URL
try:
    counter = 0
    crawl_website(url)
except Exception as e:
    print(f"An error occurred during the crawl: {str(e)}")

# Save the Excel file
wb.save(file_path)
print('Excel file saved to:', file_path)

Conclusion

By combining web crawling, natural language processing, and data storage capabilities, this Python script provides a valuable resource for maintaining high-quality and error-free textual content on websites. The script can be further extended with additional features such incorporating more advanced language processing tools, or scan only specific parts such as <body> tag etc.

Diwakar Loomba

Diwakar Loomba is the founder of AIHelperHub and a veteran digital strategist with over 10 years of experience in data driven performance and growth marketing.
Diwakar leveraged advanced SEO strategies along with AI and python to enhance user experience, boost conversion rates, and amplify brand awareness across diverse online businesses, including IT/ITeS, E-commerce, Telecommunications, Automobile and other B2B & B2C businesses.