In the age of AI and big data, there is a growing demand for churning out more and more data. Research based companies now routinely crawl the web to gather external data for purposes like monitoring competitors, digesting news, identifying market trends, or collecting stock prices to inform predictive models. Web crawlers automate the process of browsing and extracting information from the internet based on predefined rules. Their importance is rising as more organizations seek large volumes of web data for analysis.
Web crawling is a powerful technique for extracting data from websites, and Python is an excellent choice for building web crawlers. In this guide, we will walk you through the process of creating a web crawler using Python. We will use the code you’ve provided, utilizing libraries such as Requests, BeautifulSoup, and OpenPyXL to scrape and store data.
Python offers a robust environment for web crawling, thanks to its extensive libraries and packages. Whether you’re extracting data for research, analysis, or any other purpose, Python is the ideal language.
Classification Of Web Crawlers
There are several main types of web crawlers used for different purposes.
- General crawlers broadly crawl the web to index as many pages as possible, as seen with search engine crawlers like Googlebot.
- Focused crawlers target specific websites or topics, gathering data relevant to a particular domain.
- Incremental crawlers revisit web pages periodically to identify new content. Deep crawlers extensively spider a site by following every link to discover pages.
- Vertical crawlers dig deeply into a specific industry or niche.
- Mobile crawlers are designed to crawl and render pages as they appear on mobile devices.
Each type serves a different need, whether it’s building a comprehensive web index, gathering data within a niche, monitoring updates, mapping site architecture, optimizing for mobile, or other focused goals. The crawler system must be tailored to the specific use case.
Basic Workflow Of Web Crawlers
- Crawl Starts with Seed URLs:
- The crawler is initialized with a list of seed URLs to visit first. These act as the starting points.
- URL Frontier Queue:
- The crawler adds URLs from crawled pages to a frontier queue to be visited. This manages the crawl order.
- URL Scheduling:
- URLs in the frontier are scheduled based on priority algorithms to maximize efficiency.
- Fetch Web Page:
- The crawler sends HTTP requests to fetch pages based on the schedule.
- Parse Page Contents:
- Downloaded pages are parsed to extract text, links, images, metadata, etc.
- Store Data:
- Extracted information is stored in a structured database or indexes.
- Analyze Links:
- Any links in the page are analyzed to add to the URL frontier.
- Rinse and Repeat:
- The process repeats continuously, scheduling and fetching new pages from the frontier.
- Crawl Ends When Conditions Met:
- The crawl ends after certain conditions are satisfied, such as time limits or coverage goals.
Now, let’s get started with setting up your environment and understanding the fundamental concepts of web crawling with Python.
Setting Up Your Environment
Before you begin your web crawling journey, it’s crucial to set up your Python environment. Here’s how you can do it:
- Install Python: If you don’t have Python installed, download and install the latest version from the official Python website.
- Install Necessary Libraries: You’ll need specific libraries for web crawling. One of the most vital is the
Requests
library for making HTTP requests. You can install it using pip:- import os library provides functions for interacting with the operating system.
- import signal library enables sending interrupt signals to the current process.
- BeautifulSoup library parses HTML and XML documents into Python objects.
- requests library sends HTTP requests and handles responses.
- Workbook library creates and manages Excel workbook/worksheet objects.
- urlparse, urljoin libraries splits apart and combines URLs.
- urllib3 – a package for making HTTP requests.
- sys library provides system-specific parameters and functions.
import os
import signal
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook
from datetime import datetime
from urllib.parse import urlparse, urljoin
import urllib3
import sys
Understanding the Basics of HTTP Requests
Understanding HTTP requests is fundamental for web crawling. Your provided code uses the Requests
library to send HTTP GET requests. Here’s how it works:
# Code for making an HTTP GET request
url = "https://www.xyz.com"
response = requests.get(url)
In this code, we import the requests
library and send an HTTP GET request to the specified URL. The response
object contains the server’s response, including the content of the web page, status code, headers, and more. This content can be processed and parsed for data extraction in your web crawling endeavors.
Web servers may return 403 Forbidden errors when crawlers attempt to access page content, blocking them through anti-crawler settings to prevent malicious data collection. These access restrictions occur because the server cannot verify the crawler is harmless based on the request headers alone. To bypass a 403 error and access the web page, crawlers can mimic or spoof browser header information in their requests to appear as normal user traffic. This simulates the headers sent by web browsers like Chrome, Firefox or Safari. By disguising its identity, the crawler can trick the server into allowing it to scrape the page content, despite anti-crawler protections being active. However, this technique should be used judiciously, only when absolutely necessary for collecting data.
Parsing HTML with BeautifulSoup
To extract data from web pages, your code utilizes BeautifulSoup, a powerful HTML parsing library. Here’s how it works:
# Code for parsing HTML with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
In this code, we import the BeautifulSoup
class from the bs4
library and create a BeautifulSoup object (soup
) by passing it the HTML content from the HTTP response. This object enables you to navigate and extract specific elements from the HTML, simplifying your interaction with the web page’s structure.
Building a Web Crawler in Python
Now that you have a solid foundation in setting up your Python environment and understanding the basics of HTTP requests and HTML parsing, it’s time to dive into the practical aspect of building a web crawler using the code you’ve provided.
Your provided code defines a web crawler that systematically explores a website and extracts valuable data. Here’s a glimpse of how it operates:
# Function to extract data from the website and write to Excel
def crawl_website(url, previous_url='', is_internal=True):
# Your crawling code here
pass
In this code, the crawl_website
function plays a central role in the crawling process. It begins by checking if the URL has already been visited or if the execution was interrupted, ensuring the efficiency of the crawl.
The code also incorporates logic to skip over URLs that match predefined patterns, helping you focus on the most relevant data. This is essential when crawling websites with diverse content and structure.
Your web crawler maintains a record of visited URLs and their associated information. It stores this data in an Excel file, allowing for systematic organization and future analysis.
Storing the Crawled Data in Excel
Once you’ve successfully extracted data with your web crawler, the next step is to store and analyze that data. Your provided code includes mechanisms for saving the crawled data in an Excel file, providing a structured format for future analysis. Here’s a snippet of the relevant code:
# Save the Excel file
wb.save(file_path)
In this code, the wb
object represents an Excel workbook, and the save
method is used to store the crawled data in an Excel file. This format offers flexibility for further data analysis, visualization, and manipulation using tools like Microsoft Excel or other data analysis software.
You can expand on this section by explaining how to read data from the Excel file, perform data analysis, and visualize the results. This is where the valuable insights from your web crawling efforts come to light.
With the knowledge of setting up your environment, making HTTP requests, parsing HTML, and building a web crawler, you have the essential tools and skills to explore and extract data from websites efficiently.
Full Python Code
import os
import signal
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook
from datetime import datetime
from urllib.parse import urlparse, urljoin
import urllib3
import sys
# Increase the recursion limit
sys.setrecursionlimit(3000) # Set a higher recursion limit (adjust the value as needed)
# Replace xyz.com with your website URL to crawl
url = "https://www.xyz.com"
main_url = url
# Parse the domain from the main URL
main_domain = urlparse(main_url).netloc
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
skip_patterns = [
"https://abc.xyz.com",
"https://www.xyz.com/sub"
# Add more URL patterns to skip here
]
file_path = r'C:\Users\crawler.xlsx'
# Remove the existing file if it exists
if os.path.isfile(file_path):
os.remove(file_path)
# Create a new workbook
wb = Workbook()
ws_internal = wb.active
ws_internal.title = 'Internal URLs'
# Define the headers for the Excel file
headers_internal = ['URL', 'Previous Linking URL']
# Write the headers to the internal worksheet
ws_internal.append(headers_internal)
# Set to store visited URLs
visited_urls = set()
counter = 0
# Signal handler for interrupt signal (Ctrl+C)
def interrupt_handler(signal, frame):
# Save the Excel file
wb.save(file_path)
print('Excel file saved to:', file_path)
exit(0)
# Register the signal handler
signal.signal(signal.SIGINT, interrupt_handler)
# Function to extract data from the website and write to Excel
def crawl_website(url, previous_url='', is_internal=True):
# Check if the URL has already been visited or if execution was interrupted
global counter
global status_code
if url in visited_urls:
#print("Skipping URL - visited URL:", url)
return
for pattern in skip_patterns:
if url.startswith(pattern):
print("Skipping URL:", url)
return
# Parse the domain from the current URL
current_domain = urlparse(url).netloc
# Check if the current URL is from the same domain as main_url
if main_domain != current_domain:
is_internal = False # Mark the URL as external
return
# Add the URL to the visited set
print("Adding URL - visited URL:", url)
visited_urls.add(url)
try:
# Send a GET request to the website with verify=False for SSL certificate verification bypass
response = requests.get(url, allow_redirects=False, verify=False)
except requests.exceptions.InvalidSchema:
# Skip invalid URLs
return
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Extract the status code
status_code = response.status_code
# Get the current crawl timestamp
crawl_timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
# Write the data to the appropriate worksheet based on internal/external URL
if is_internal:
worksheet = ws_internal
# Write the data to the Excel sheet
worksheet.append([url, previous_url])
print("adding in excel:"+url)
counter+=1
print(f"Crawl Number: {counter}")
# Crawl the URLs only for internal pages
if is_internal:
# Recursively crawl the links on the page
for link in soup.find_all('a'):
href = link.get('href')
if href and href != '/' and not href.startswith('#'):
if href.startswith('/'):
href = urljoin(main_url, href)
if href not in visited_urls:
print(f"Crawling URL: {href}")
crawl_website(href, url, is_internal)
try:
# Call the crawl_website function for the main URL
status_code= ''
crawl_website(url)
except Exception as e:
print(f"An error occurred during the crawl: {str(e)}")
# Save the Excel file
wb.save(file_path)
# Print the file path
print('Excel file saved to:', file_path)
A note about Ethical Crawling
When embarking on your web crawling journey, remember to crawl your own website for practice and experimentation. Additionally, it’s crucial to respect the robots.txt
file of websites you intend to crawl. The robots.txt
file contains directives that guide web crawlers on which parts of a site are off-limits for crawling. Respecting these guidelines not only ensures ethical web crawling but also helps maintain a positive online ecosystem. Always be mindful of the terms of service and legalities of the websites you intend to crawl to avoid any potential legal issues or disruptions.
Troubleshooting and Advanced Techniques
As your web crawling projects become more sophisticated, you may encounter challenges and opportunities for improvement. You can evolve your crawling project to a fully fledged SEO tech audit tool that can edit technical errors on your website such as absence of meta titles, description, non-canonical links etc. In our next project we will be crawling the complete website along with their status, meta titles and description. With the knowledge and code you’ve gained from this tutorial, you’re well-equipped to embark on your web crawling journey. Happy crawling!
Diwakar Loomba is the founder of AIHelperHub and a veteran digital strategist with over 10 years of experience in data driven performance and growth marketing.
Diwakar leveraged advanced SEO strategies along with AI and python to enhance user experience, boost conversion rates, and amplify brand awareness across diverse online businesses, including IT/ITeS, E-commerce, Telecommunications, Automobile and other B2B & B2C businesses.