NLP content clustering is a smart way to organize and understand large amounts of content using python. This method is super useful in data analysis and text mining, helping us find patterns and group similar content/URL together.
In this guide, we’ll walk you through how to use NLP (Natural Language Processing) and a popular method called k-means clustering to organize content from over 1,000 URLs using Python. Don’t worry if some of these terms sound new – we’ll explain everything step by step!
Understanding NLP and Clustering
Let’s start with the basics. NLP, or Natural Language Processing, is like teaching computers to understand and work with human language. It’s the technology behind things like voice assistants and translation apps.
Clustering, on the other hand, is a way to group similar things together. In NLP, we use clustering to organize text into groups based on how alike they are. It’s like sorting a big pile of clothes into neat stacks of similar items.
There are different ways to do clustering, but one of the most common is called k-means clustering. We’ll focus on this method in our guide. Other methods include hierarchical clustering, which creates a tree-like structure of groups.
Step by step process
Let’s break down the process of preparing our dataset:
1. Parsing the Sitemap
Our first step in the content clustering process is to gather the URLs we want to analyze. This is a crucial step as the quality and relevance of our dataset will directly impact the effectiveness of our clustering results. In this guide, we’re using an XML sitemap as our source of URLs, which is a common and efficient method for large-scale web content analysis. We start by defining a function to parse an XML sitemap:
import requests
from bs4 import BeautifulSoup
def parse_sitemap(sitemap_url):
response = requests.get(sitemap_url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'xml')
urls = [url.text.strip() for url in soup.find_all('loc')
if not url.text.strip().lower().endswith(image_extensions)]
return urls
else:
print(f"Error retrieving sitemap: Status code {response.status_code}")
return None
sitemap_url = "https://www.xxx.com/sitemap.xml"
urls = parse_sitemap(sitemap_url)
This function does several important things:
- It sends a GET request to the sitemap URL using the
requests
library. - It checks if the request was successful (status code 200).
- If successful, it uses BeautifulSoup to parse the XML content.
- It extracts all the
<loc>
tags, which contain the URLs. - It filters out any image URLs to focus only on web pages.
2. Content Extraction using Text Processing
After gathering our URLs, the next crucial step is to extract and process the text content from each webpage. This step is vital as it transforms the raw HTML content into a format suitable for analysis and clustering. Let’s break down this process:
- Fetching Content: We use the
trafilatura
library to fetch and extract the main content from each URL:
import trafilatura
from concurrent.futures import ThreadPoolExecutor, as_completed
def fetch_text_from_url(session, url):
print(f"Processing URL: {url}")
try:
downloaded = trafilatura.fetch_url(url)
if downloaded:
text = trafilatura.extract(downloaded)
if text:
print(f"Successfully fetched text from {url}")
return text
else:
print(f"Failed to extract text from {url}.")
return None
else:
print(f"Failed to download content from {url}.")
return None
except Exception as e:
print(f"Error fetching text from URL: {e}")
return None
This function does several important things:
- It uses
trafilatura.fetch_url()
to download the content of each webpage. - It then uses
trafilatura.extract()
to extract the main text content, removing HTML tags and boilerplate content. - It includes error handling to manage cases where content can’t be fetched or extracted.
- Text Preprocessing: Once we have the raw text, we preprocess it to make it suitable for analysis:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
def preprocess_text(text):
# Convert text to lowercase
text = text.lower()
# Remove punctuation and numbers
text = ''.join([char for char in text if char.isalpha() or char.isspace()])
# Tokenize text
words = word_tokenize(text)
# Remove stopwords
stop_words_english = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words_english]
# Rejoin words into a single string
text = ' '.join(words)
return text
This preprocessing function:
- Converts all text to lowercase for consistency.
- Removes punctuation and numbers to focus on word content.
- Tokenizes the text into individual words.
- Removes common English stopwords to focus on meaningful content.
- Parallel Processing: To handle large numbers of URLs efficiently, we use parallel processing:
def extract_texts(urls):
documents = []
with ThreadPoolExecutor(max_workers=10) as executor:
future_to_url = {executor.submit(fetch_text_from_url, session, url): url for url in urls}
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
text = future.result()
if text:
processed_text = preprocess_text(text)
documents.append((url, processed_text))
else:
print(f"Failed to retrieve text from {url}. Skipping.")
except Exception as e:
print(f"Error processing {url}: {e}")
return documents
This function uses Python’s ThreadPoolExecutor
to process multiple URLs concurrently, significantly speeding up the extraction process for large datasets.
By thoroughly processing and cleaning the text content from each URL, we create a solid foundation for the subsequent clustering step.
3. Implementing Clustering
With our processed text data in hand, we can now move on to the clustering step. In this implementation, we’re using a similarity-based approach rather than traditional k-means clustering. This method allows us to group URLs based on the similarity of their content without having to predefine the number of clusters. Let’s break down this process:
- Vectorizing Documents: First, we need to convert our processed text into a numerical format that can be analyzed:
from sklearn.feature_extraction.text import CountVectorizer
def vectorize_documents(documents):
urls, texts = zip(*documents)
print("Vectorizing documents...")
vectorizer = CountVectorizer(stop_words='english')
text_vectorized = vectorizer.fit_transform(texts)
return urls, text_vectorized, vectorizer
This function:
- Separates our URLs and processed texts.
- Uses sklearn’s
CountVectorizer
to convert the text into a matrix of token counts. - The
stop_words='english'
parameter provides an additional layer of stopword removal.
- Calculating Similarity: Next, we calculate the similarity between documents:
from sklearn.metrics.pairwise import cosine_similarity
def cluster_by_similarity(urls, text_vectorized, vectorizer, similarity_threshold):
print("Calculating cosine similarity matrix...")
similarity_matrix = cosine_similarity(text_vectorized)
We use cosine similarity, which measures the cosine of the angle between two vectors. This is a common choice for text analysis as it’s independent of document length.
Similarity Threshold: The similarity_threshold
is a crucial parameter. A higher threshold will result in more, smaller clusters, while a lower threshold will create fewer, larger clusters. You may need to experiment to find the right balance for your specific use case.
Read more – How to Use Python for SEO
- Forming Clusters: We then form clusters based on the calculated similarities:
clusters = {i: [i] for i in range(len(urls))}
for i in range(len(urls)):
for j in range(i + 1, len(urls)):
if similarity_matrix[i, j] >= similarity_threshold:
clusters[i].append(j)
clusters[j].append(i)
This step:
- Initializes each URL as its own cluster.
- Compares each pair of URLs.
- If their similarity is above the threshold, they’re added to each other’s clusters.
- Processing and Displaying Clusters: Finally, we process the clusters to remove duplicates and display them:
clusters = {url_index: list(set(related_urls)) for url_index, related_urls in clusters.items() if related_urls}
url_to_cluster = {}
visited = set()
for url_index, related_urls in clusters.items():
if url_index not in visited:
cluster = []
for index in related_urls:
if index not in visited:
cluster.append(urls[index])
visited.add(index)
url_to_cluster[urls[url_index]] = cluster
cluster_number = 1
for url, cluster in url_to_cluster.items():
print(f"\nCluster {cluster_number}: ")
print(" - " + "\n - ".join(cluster))
cluster_number += 1
This part of the function:
- Groups URLs into distinct clusters.
- Prints each cluster with a numbered label.
Google Colab Code Link
Considerations and Potential Improvements:
- Language Detection: For multilingual websites, you might want to add language detection and use language-specific stopwords.
- Content Cleaning: Depending on your needs, you might want to add more sophisticated content cleaning, such as removing HTML entities or specific patterns.
- Handling Different Content Types: You might need to adapt the extraction process for different types of content (e.g., news articles vs. product pages).
Diwakar Loomba is the founder of AIHelperHub and a veteran digital strategist with over 10 years of experience in data driven performance and growth marketing.
Diwakar leveraged advanced SEO strategies along with AI and python to enhance user experience, boost conversion rates, and amplify brand awareness across diverse online businesses, including IT/ITeS, E-commerce, Telecommunications, Automobile and other B2B & B2C businesses.