How to crawl an entire website using python

In the age of AI and big data, there is a growing demand for churning out more and more data. Research based companies now routinely crawl the web to gather external data for purposes like monitoring competitors, digesting news, identifying market trends, or collecting stock prices to inform predictive models. Web crawlers automate the process of browsing and extracting information from the internet based on predefined rules. Their importance is rising as more organizations seek large volumes of web data for analysis.

Web crawling is a powerful technique for extracting data from websites, and Python is an excellent choice for building web crawlers. In this guide, we will walk you through the process of creating a web crawler using Python. We will use the code you’ve provided, utilizing libraries such as Requests, BeautifulSoup, and OpenPyXL to scrape and store data.

Python offers a robust environment for web crawling, thanks to its extensive libraries and packages. Whether you’re extracting data for research, analysis, or any other purpose, Python is the ideal language.

Classification Of Web Crawlers

There are several main types of web crawlers used for different purposes.

  • General crawlers broadly crawl the web to index as many pages as possible, as seen with search engine crawlers like Googlebot.
  • Focused crawlers target specific websites or topics, gathering data relevant to a particular domain.
  • Incremental crawlers revisit web pages periodically to identify new content. Deep crawlers extensively spider a site by following every link to discover pages.
  • Vertical crawlers dig deeply into a specific industry or niche.
  • Mobile crawlers are designed to crawl and render pages as they appear on mobile devices.

Each type serves a different need, whether it’s building a comprehensive web index, gathering data within a niche, monitoring updates, mapping site architecture, optimizing for mobile, or other focused goals. The crawler system must be tailored to the specific use case.

Basic Workflow Of Web Crawlers

  • Crawl Starts with Seed URLs:
    • The crawler is initialized with a list of seed URLs to visit first. These act as the starting points.
  • URL Frontier Queue:
    • The crawler adds URLs from crawled pages to a frontier queue to be visited. This manages the crawl order.
  • URL Scheduling:
    • URLs in the frontier are scheduled based on priority algorithms to maximize efficiency.
  • Fetch Web Page:
    • The crawler sends HTTP requests to fetch pages based on the schedule.
  • Parse Page Contents:
    • Downloaded pages are parsed to extract text, links, images, metadata, etc.
  • Store Data:
    • Extracted information is stored in a structured database or indexes.
  • Analyze Links:
    • Any links in the page are analyzed to add to the URL frontier.
  • Rinse and Repeat:
    • The process repeats continuously, scheduling and fetching new pages from the frontier.
  • Crawl Ends When Conditions Met:
    • The crawl ends after certain conditions are satisfied, such as time limits or coverage goals.

Now, let’s get started with setting up your environment and understanding the fundamental concepts of web crawling with Python.

Setting Up Your Environment

Before you begin your web crawling journey, it’s crucial to set up your Python environment. Here’s how you can do it:

  1. Install Python: If you don’t have Python installed, download and install the latest version from the official Python website.
  2. Install Necessary Libraries: You’ll need specific libraries for web crawling. One of the most vital is the Requests library for making HTTP requests. You can install it using pip:
    • import os library provides functions for interacting with the operating system.
    • import signal library enables sending interrupt signals to the current process.
    • BeautifulSoup library parses HTML and XML documents into Python objects.
    • requests library sends HTTP requests and handles responses.
    • Workbook library creates and manages Excel workbook/worksheet objects.
    • urlparse, urljoin libraries splits apart and combines URLs.
    • urllib3 – a package for making HTTP requests.
    • sys library provides system-specific parameters and functions.
import os
import signal
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook
from datetime import datetime
from urllib.parse import urlparse, urljoin
import urllib3
import sys

Understanding the Basics of HTTP Requests

Understanding HTTP requests is fundamental for web crawling. Your provided code uses the Requests library to send HTTP GET requests. Here’s how it works:

# Code for making an HTTP GET request
url = "https://www.xyz.com"
response = requests.get(url)

In this code, we import the requests library and send an HTTP GET request to the specified URL. The response object contains the server’s response, including the content of the web page, status code, headers, and more. This content can be processed and parsed for data extraction in your web crawling endeavors.

Web servers may return 403 Forbidden errors when crawlers attempt to access page content, blocking them through anti-crawler settings to prevent malicious data collection. These access restrictions occur because the server cannot verify the crawler is harmless based on the request headers alone. To bypass a 403 error and access the web page, crawlers can mimic or spoof browser header information in their requests to appear as normal user traffic. This simulates the headers sent by web browsers like Chrome, Firefox or Safari. By disguising its identity, the crawler can trick the server into allowing it to scrape the page content, despite anti-crawler protections being active. However, this technique should be used judiciously, only when absolutely necessary for collecting data.

Parsing HTML with BeautifulSoup

To extract data from web pages, your code utilizes BeautifulSoup, a powerful HTML parsing library. Here’s how it works:

# Code for parsing HTML with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

In this code, we import the BeautifulSoup class from the bs4 library and create a BeautifulSoup object (soup) by passing it the HTML content from the HTTP response. This object enables you to navigate and extract specific elements from the HTML, simplifying your interaction with the web page’s structure.

Building a Web Crawler in Python

Now that you have a solid foundation in setting up your Python environment and understanding the basics of HTTP requests and HTML parsing, it’s time to dive into the practical aspect of building a web crawler using the code you’ve provided.

Your provided code defines a web crawler that systematically explores a website and extracts valuable data. Here’s a glimpse of how it operates:

# Function to extract data from the website and write to Excel
def crawl_website(url, previous_url='', is_internal=True):
    # Your crawling code here
    pass

In this code, the crawl_website function plays a central role in the crawling process. It begins by checking if the URL has already been visited or if the execution was interrupted, ensuring the efficiency of the crawl.

The code also incorporates logic to skip over URLs that match predefined patterns, helping you focus on the most relevant data. This is essential when crawling websites with diverse content and structure.

Your web crawler maintains a record of visited URLs and their associated information. It stores this data in an Excel file, allowing for systematic organization and future analysis.

Storing the Crawled Data in Excel

Once you’ve successfully extracted data with your web crawler, the next step is to store and analyze that data. Your provided code includes mechanisms for saving the crawled data in an Excel file, providing a structured format for future analysis. Here’s a snippet of the relevant code:

# Save the Excel file
wb.save(file_path)

In this code, the wb object represents an Excel workbook, and the save method is used to store the crawled data in an Excel file. This format offers flexibility for further data analysis, visualization, and manipulation using tools like Microsoft Excel or other data analysis software.

You can expand on this section by explaining how to read data from the Excel file, perform data analysis, and visualize the results. This is where the valuable insights from your web crawling efforts come to light.

With the knowledge of setting up your environment, making HTTP requests, parsing HTML, and building a web crawler, you have the essential tools and skills to explore and extract data from websites efficiently.

Full Python Code

import os
import signal
from bs4 import BeautifulSoup
import requests
from openpyxl import Workbook
from datetime import datetime
from urllib.parse import urlparse, urljoin
import urllib3
import sys

# Increase the recursion limit
sys.setrecursionlimit(3000)  # Set a higher recursion limit (adjust the value as needed)

# Replace xyz.com with your website URL to crawl
url = "https://www.xyz.com"
main_url = url
# Parse the domain from the main URL
main_domain = urlparse(main_url).netloc

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

skip_patterns = [
    "https://abc.xyz.com",
    "https://www.xyz.com/sub"   
    # Add more URL patterns to skip here
]

file_path = r'C:\Users\crawler.xlsx'

# Remove the existing file if it exists
if os.path.isfile(file_path):
    os.remove(file_path)

# Create a new workbook
wb = Workbook()
ws_internal = wb.active
ws_internal.title = 'Internal URLs'


# Define the headers for the Excel file
headers_internal = ['URL', 'Previous Linking URL']

# Write the headers to the internal worksheet
ws_internal.append(headers_internal)


# Set to store visited URLs
visited_urls = set()

counter = 0

# Signal handler for interrupt signal (Ctrl+C)
def interrupt_handler(signal, frame):
    # Save the Excel file
    wb.save(file_path)
    print('Excel file saved to:', file_path)
    exit(0)

# Register the signal handler
signal.signal(signal.SIGINT, interrupt_handler)

# Function to extract data from the website and write to Excel
def crawl_website(url, previous_url='', is_internal=True):
    # Check if the URL has already been visited or if execution was interrupted
    global counter
    global status_code 
    
    if url in visited_urls:
        #print("Skipping URL - visited URL:", url)
        return

    for pattern in skip_patterns:
        if url.startswith(pattern):
            print("Skipping URL:", url)
            return
    
    # Parse the domain from the current URL
    current_domain = urlparse(url).netloc

    # Check if the current URL is from the same domain as main_url
    if main_domain != current_domain:
        is_internal = False  # Mark the URL as external
        return

    # Add the URL to the visited set
    print("Adding URL - visited URL:", url)
    visited_urls.add(url)

    try:
        # Send a GET request to the website with verify=False for SSL certificate verification bypass
        response = requests.get(url, allow_redirects=False, verify=False)
        
    except requests.exceptions.InvalidSchema:
        # Skip invalid URLs
        return

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    
    # Extract the status code
    status_code = response.status_code

    # Get the current crawl timestamp
    crawl_timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

    # Write the data to the appropriate worksheet based on internal/external URL
    if is_internal:
        worksheet = ws_internal
                
        # Write the data to the Excel sheet
        worksheet.append([url, previous_url])

        print("adding in excel:"+url)
        
        counter+=1
        print(f"Crawl Number: {counter}")
        
    # Crawl the URLs only for internal pages
    if is_internal:
        # Recursively crawl the links on the page
        for link in soup.find_all('a'):
            href = link.get('href')
            if href and href != '/' and not href.startswith('#'):
                if href.startswith('/'):
                    href = urljoin(main_url, href)
                if href not in visited_urls:
                    print(f"Crawling URL: {href}")
                    crawl_website(href, url, is_internal)
                    
        
                        

try:
    # Call the crawl_website function for the main URL
    status_code= ''
    crawl_website(url)
except Exception as e:
    print(f"An error occurred during the crawl: {str(e)}")

# Save the Excel file
wb.save(file_path)

# Print the file path
print('Excel file saved to:', file_path)

A note about Ethical Crawling

When embarking on your web crawling journey, remember to crawl your own website for practice and experimentation. Additionally, it’s crucial to respect the robots.txt file of websites you intend to crawl. The robots.txt file contains directives that guide web crawlers on which parts of a site are off-limits for crawling. Respecting these guidelines not only ensures ethical web crawling but also helps maintain a positive online ecosystem. Always be mindful of the terms of service and legalities of the websites you intend to crawl to avoid any potential legal issues or disruptions.

Troubleshooting and Advanced Techniques

As your web crawling projects become more sophisticated, you may encounter challenges and opportunities for improvement. You can evolve your crawling project to a fully fledged SEO tech audit tool that can edit technical errors on your website such as absence of meta titles, description, non-canonical links etc. In our next project we will be crawling the complete website along with their status, meta titles and description. With the knowledge and code you’ve gained from this tutorial, you’re well-equipped to embark on your web crawling journey. Happy crawling!

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x