0
0
PythonHow-ToBeginner · 3 min read

How to Scrape Website Using Python: Simple Guide with Example

To scrape a website using Python, use the requests library to get the webpage content and BeautifulSoup from bs4 to parse the HTML. This lets you extract data like text or links easily by selecting HTML elements.
📐

Syntax

Here is the basic syntax to scrape a webpage:

  • import requests: to fetch the webpage.
  • from bs4 import BeautifulSoup: to parse HTML content.
  • response = requests.get(url): sends a request to the URL.
  • soup = BeautifulSoup(response.text, 'html.parser'): parses the HTML text.
  • soup.find() or soup.select(): to find elements in the HTML.
python
import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the first <h1> tag
heading = soup.find('h1')
print(heading.text)
Output
Example Domain
💻

Example

This example fetches the homepage of example.com and prints the main heading text inside the <h1> tag.

python
import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

heading = soup.find('h1')
print(heading.text)
Output
Example Domain
⚠️

Common Pitfalls

  • Not checking response status: Always check response.status_code to ensure the page loaded successfully (200 means OK).
  • Parsing wrong content: Use the correct parser like 'html.parser' or 'lxml'.
  • Ignoring website rules: Always check robots.txt and terms of service before scraping.
  • Not handling errors: Use try-except blocks to handle network or parsing errors gracefully.
python
import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    heading = soup.find('h1')
    if heading:
        print(heading.text)
    else:
        print('Heading not found')
else:
    print(f'Failed to retrieve page: {response.status_code}')
Output
Example Domain
📊

Quick Reference

Tips for effective web scraping with Python:

  • Use requests.get() to fetch pages.
  • Parse HTML with BeautifulSoup using 'html.parser'.
  • Use soup.find() or soup.select() to locate elements.
  • Check response.status_code before parsing.
  • Respect website rules and avoid heavy scraping.

Key Takeaways

Use requests to fetch webpage content and BeautifulSoup to parse HTML.
Always check the response status code before parsing the page.
Use soup.find() or soup.select() to extract specific HTML elements.
Respect website rules and handle errors to avoid scraping issues.