PythonHow-ToBeginner · 3 min read

How to Parse HTML in Python: Simple Guide with Examples

To parse HTML in Python, use the BeautifulSoup library from bs4. Load your HTML content into BeautifulSoup and then use its methods to navigate and extract data from the HTML structure.

📐

Syntax

Use BeautifulSoup by importing it from bs4. Create a BeautifulSoup object by passing your HTML string and a parser like html.parser. Then use methods like find() or find_all() to locate elements.

BeautifulSoup(html, 'html.parser'): Parses the HTML string.
find(tag): Finds the first occurrence of a tag.
find_all(tag): Finds all occurrences of a tag.

python

from bs4 import BeautifulSoup

html = '<html><body><h1>Title</h1><p>Paragraph</p></body></html>'
soup = BeautifulSoup(html, 'html.parser')

first_h1 = soup.find('h1')
all_p = soup.find_all('p')

💻

Example

This example shows how to parse a simple HTML string, extract the text inside the first <h1> tag and all <p> tags.

python

from bs4 import BeautifulSoup

html = '''
<html>
  <body>
    <h1>Welcome to Python</h1>
    <p>This is a paragraph.</p>
    <p>This is another paragraph.</p>
  </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Extract first h1 text
header = soup.find('h1').text

# Extract all paragraph texts
paragraphs = [p.text for p in soup.find_all('p')]

print('Header:', header)
print('Paragraphs:', paragraphs)

Output

Header: Welcome to Python Paragraphs: ['This is a paragraph.', 'This is another paragraph.']

⚠️

Common Pitfalls

Common mistakes include:

Not installing beautifulsoup4 before use (pip install beautifulsoup4).
Using the wrong parser or forgetting to specify one.
Trying to parse incomplete or malformed HTML without handling errors.
Accessing tags without checking if they exist, which causes errors.

Always check if find() returns None before accessing .text.

python

from bs4 import BeautifulSoup

html = '<html><body></body></html>'
soup = BeautifulSoup(html, 'html.parser')

# Wrong: assumes h1 exists
# header = soup.find('h1').text  # This raises AttributeError

# Right: check before accessing
header_tag = soup.find('h1')
if header_tag:
    header = header_tag.text
else:
    header = 'No h1 tag found'

print(header)

Output

No h1 tag found

📊

Quick Reference

Here is a quick summary of useful BeautifulSoup methods:

Method	Description
BeautifulSoup(html, 'html.parser')	Create a soup object to parse HTML
find(tag)	Find first occurrence of a tag
find_all(tag)	Find all occurrences of a tag
tag.text	Get text inside a tag
tag.attrs	Get attributes of a tag as a dictionary
soup.select(css_selector)	Find elements using CSS selectors

✅

Key Takeaways

Use BeautifulSoup from bs4 to parse HTML easily in Python.

Always specify a parser like 'html.parser' when creating the BeautifulSoup object.

Check if elements exist before accessing their properties to avoid errors.

Install the library first using 'pip install beautifulsoup4'.

Use methods like find(), find_all(), and select() to navigate HTML.