Python NLP Tutorial: Cluster 1000+ URLs with Content Similarity

NLP content clustering is a smart way to organize and understand large amounts of content using python. This method is super useful in data analysis and text mining, helping us find patterns and group similar content/URL together.

In this guide, we’ll walk you through how to use NLP (Natural Language Processing) and a popular method called k-means clustering to organize content from over 1,000 URLs using Python. Don’t worry if some of these terms sound new – we’ll explain everything step by step!

Understanding NLP and Clustering

Let’s start with the basics. NLP, or Natural Language Processing, is like teaching computers to understand and work with human language. It’s the technology behind things like voice assistants and translation apps.

Clustering, on the other hand, is a way to group similar things together. In NLP, we use clustering to organize text into groups based on how alike they are. It’s like sorting a big pile of clothes into neat stacks of similar items.

There are different ways to do clustering, but one of the most common is called k-means clustering. We’ll focus on this method in our guide. Other methods include hierarchical clustering, which creates a tree-like structure of groups.

Step by step process

Let’s break down the process of preparing our dataset:

1. Parsing the Sitemap

Our first step in the content clustering process is to gather the URLs we want to analyze. This is a crucial step as the quality and relevance of our dataset will directly impact the effectiveness of our clustering results. In this guide, we’re using an XML sitemap as our source of URLs, which is a common and efficient method for large-scale web content analysis. We start by defining a function to parse an XML sitemap:

import requests
from bs4 import BeautifulSoup

def parse_sitemap(sitemap_url):
    response = requests.get(sitemap_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'xml')
        urls = [url.text.strip() for url in soup.find_all('loc')
                if not url.text.strip().lower().endswith(image_extensions)]
        return urls
    else:
        print(f"Error retrieving sitemap: Status code {response.status_code}")
        return None

sitemap_url = "https://www.xxx.com/sitemap.xml"
urls = parse_sitemap(sitemap_url)

This function does several important things:

It sends a GET request to the sitemap URL using the requests library.
It checks if the request was successful (status code 200).
If successful, it uses BeautifulSoup to parse the XML content.
It extracts all the <loc> tags, which contain the URLs.
It filters out any image URLs to focus only on web pages.

2. Content Extraction using Text Processing

After gathering our URLs, the next crucial step is to extract and process the text content from each webpage. This step is vital as it transforms the raw HTML content into a format suitable for analysis and clustering. Let’s break down this process:

Fetching Content: We use the trafilatura library to fetch and extract the main content from each URL:

import trafilatura
from concurrent.futures import ThreadPoolExecutor, as_completed

def fetch_text_from_url(session, url):
    print(f"Processing URL: {url}")
    try:
        downloaded = trafilatura.fetch_url(url)
        if downloaded:
            text = trafilatura.extract(downloaded)
            if text:
                print(f"Successfully fetched text from {url}")
                return text
            else:
                print(f"Failed to extract text from {url}.")
                return None
        else:
            print(f"Failed to download content from {url}.")
            return None
    except Exception as e:
        print(f"Error fetching text from URL: {e}")
        return None

This function does several important things:

It uses trafilatura.fetch_url() to download the content of each webpage.
It then uses trafilatura.extract() to extract the main text content, removing HTML tags and boilerplate content.
It includes error handling to manage cases where content can’t be fetched or extracted.

Text Preprocessing: Once we have the raw text, we preprocess it to make it suitable for analysis:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = ''.join([char for char in text if char.isalpha() or char.isspace()])
    # Tokenize text
    words = word_tokenize(text)
    # Remove stopwords
    stop_words_english = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words_english]
    # Rejoin words into a single string
    text = ' '.join(words)
    return text

This preprocessing function:

Converts all text to lowercase for consistency.
Removes punctuation and numbers to focus on word content.
Tokenizes the text into individual words.
Removes common English stopwords to focus on meaningful content.

Parallel Processing: To handle large numbers of URLs efficiently, we use parallel processing:

def extract_texts(urls):
    documents = []
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_url = {executor.submit(fetch_text_from_url, session, url): url for url in urls}
        for future in as_completed(future_to_url):
            url = future_to_url[future]
            try:
                text = future.result()
                if text:
                    processed_text = preprocess_text(text)
                    documents.append((url, processed_text))
                else:
                    print(f"Failed to retrieve text from {url}. Skipping.")
            except Exception as e:
                print(f"Error processing {url}: {e}")
    return documents

This function uses Python’s ThreadPoolExecutor to process multiple URLs concurrently, significantly speeding up the extraction process for large datasets.

By thoroughly processing and cleaning the text content from each URL, we create a solid foundation for the subsequent clustering step.

3. Implementing Clustering

With our processed text data in hand, we can now move on to the clustering step. In this implementation, we’re using a similarity-based approach rather than traditional k-means clustering. This method allows us to group URLs based on the similarity of their content without having to predefine the number of clusters. Let’s break down this process:

Vectorizing Documents: First, we need to convert our processed text into a numerical format that can be analyzed:

from sklearn.feature_extraction.text import CountVectorizer

def vectorize_documents(documents):
    urls, texts = zip(*documents)
    print("Vectorizing documents...")
    vectorizer = CountVectorizer(stop_words='english')
    text_vectorized = vectorizer.fit_transform(texts)
    return urls, text_vectorized, vectorizer

This function:

Separates our URLs and processed texts.
Uses sklearn’s CountVectorizer to convert the text into a matrix of token counts.
The stop_words='english' parameter provides an additional layer of stopword removal.

Calculating Similarity: Next, we calculate the similarity between documents:

from sklearn.metrics.pairwise import cosine_similarity

def cluster_by_similarity(urls, text_vectorized, vectorizer, similarity_threshold):
    print("Calculating cosine similarity matrix...")
    similarity_matrix = cosine_similarity(text_vectorized)

We use cosine similarity, which measures the cosine of the angle between two vectors. This is a common choice for text analysis as it’s independent of document length.

Similarity Threshold: The similarity_threshold is a crucial parameter. A higher threshold will result in more, smaller clusters, while a lower threshold will create fewer, larger clusters. You may need to experiment to find the right balance for your specific use case.

Google Colab Code Link

Considerations and Potential Improvements:

Language Detection: For multilingual websites, you might want to add language detection and use language-specific stopwords.
Content Cleaning: Depending on your needs, you might want to add more sophisticated content cleaning, such as removing HTML entities or specific patterns.
Handling Different Content Types: You might need to adapt the extraction process for different types of content (e.g., news articles vs. product pages).

Diwakar Loomba

Diwakar Loomba is the founder of AIHelperHub and a veteran digital strategist with over 10 years of experience in data driven performance and growth marketing.
Diwakar leveraged advanced SEO strategies along with AI and python to enhance user experience, boost conversion rates, and amplify brand awareness across diverse online businesses, including IT/ITeS, E-commerce, Telecommunications, Automobile and other B2B & B2C businesses.