How to handle pagination in web scraping python

PythonDebug / FixBeginner · 4 min read

How to Handle Pagination in Web Scraping with Python

To handle pagination in web scraping with Python, you need to identify the pattern of page URLs or navigation buttons and loop through them using requests or selenium. Extract data from each page inside the loop until no more pages are available.

🔍

Why This Happens

When scraping websites with multiple pages, a common mistake is to scrape only the first page and ignore the rest. This happens because the code does not loop through the pagination links or does not update the URL to the next page.

python

import requests
from bs4 import BeautifulSoup

url = 'https://example.com/products?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

products = soup.find_all('div', class_='product')
for product in products:
    print(product.text.strip())

Output

Only products from page 1 are printed, missing data from other pages.

🔧

The Fix

Change the code to loop through all pages by updating the page number in the URL. Stop when no products are found or when the last page is reached.

python

import requests
from bs4 import BeautifulSoup

page = 1
while True:
    url = f'https://example.com/products?page={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    products = soup.find_all('div', class_='product')
    if not products:
        break  # No more pages

    for product in products:
        print(product.text.strip())

    page += 1

Output

Prints products from all pages until no more products are found.

🛡️

Prevention

Always inspect the website's pagination structure before scraping. Use loops to navigate pages and check for the presence of next page links or data. Avoid hardcoding URLs for only one page.

Use tools like browser DevTools to find the pattern of page URLs or next page buttons. Consider using selenium if pagination requires clicking buttons or JavaScript rendering.

⚠️

Related Errors

Common related errors include:

Scraping the same page repeatedly due to not updating the URL.
Missing data because the scraper stops too early.
Getting blocked by the website for too many requests without delays.

Fix these by carefully managing page navigation, adding delays, and respecting website rules.

✅

Key Takeaways

Loop through page URLs or navigation links to scrape all pages.

Stop scraping when no new data or pages are found.

Inspect website pagination structure before coding.

Use tools like BeautifulSoup and requests for simple pagination.

Use selenium for JavaScript-driven pagination.