How to Handle Pagination in Web Scraping with Python
Python, you need to identify the pattern of page URLs or navigation buttons and loop through them using requests or selenium. Extract data from each page inside the loop until no more pages are available.Why This Happens
When scraping websites with multiple pages, a common mistake is to scrape only the first page and ignore the rest. This happens because the code does not loop through the pagination links or does not update the URL to the next page.
import requests from bs4 import BeautifulSoup url = 'https://example.com/products?page=1' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') products = soup.find_all('div', class_='product') for product in products: print(product.text.strip())
The Fix
Change the code to loop through all pages by updating the page number in the URL. Stop when no products are found or when the last page is reached.
import requests from bs4 import BeautifulSoup page = 1 while True: url = f'https://example.com/products?page={page}' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') products = soup.find_all('div', class_='product') if not products: break # No more pages for product in products: print(product.text.strip()) page += 1
Prevention
Always inspect the website's pagination structure before scraping. Use loops to navigate pages and check for the presence of next page links or data. Avoid hardcoding URLs for only one page.
Use tools like browser DevTools to find the pattern of page URLs or next page buttons. Consider using selenium if pagination requires clicking buttons or JavaScript rendering.
Related Errors
Common related errors include:
- Scraping the same page repeatedly due to not updating the URL.
- Missing data because the scraper stops too early.
- Getting blocked by the website for too many requests without delays.
Fix these by carefully managing page navigation, adding delays, and respecting website rules.