0
0
PythonHow-ToBeginner · 3 min read

How to Parse HTML in Python: Simple Guide with Examples

To parse HTML in Python, use the BeautifulSoup library from bs4. Load your HTML content into BeautifulSoup and then use its methods to navigate and extract data from the HTML structure.
📐

Syntax

Use BeautifulSoup by importing it from bs4. Create a BeautifulSoup object by passing your HTML string and a parser like html.parser. Then use methods like find() or find_all() to locate elements.

  • BeautifulSoup(html, 'html.parser'): Parses the HTML string.
  • find(tag): Finds the first occurrence of a tag.
  • find_all(tag): Finds all occurrences of a tag.
python
from bs4 import BeautifulSoup

html = '<html><body><h1>Title</h1><p>Paragraph</p></body></html>'
soup = BeautifulSoup(html, 'html.parser')

first_h1 = soup.find('h1')
all_p = soup.find_all('p')
💻

Example

This example shows how to parse a simple HTML string, extract the text inside the first <h1> tag and all <p> tags.

python
from bs4 import BeautifulSoup

html = '''
<html>
  <body>
    <h1>Welcome to Python</h1>
    <p>This is a paragraph.</p>
    <p>This is another paragraph.</p>
  </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

# Extract first h1 text
header = soup.find('h1').text

# Extract all paragraph texts
paragraphs = [p.text for p in soup.find_all('p')]

print('Header:', header)
print('Paragraphs:', paragraphs)
Output
Header: Welcome to Python Paragraphs: ['This is a paragraph.', 'This is another paragraph.']
⚠️

Common Pitfalls

Common mistakes include:

  • Not installing beautifulsoup4 before use (pip install beautifulsoup4).
  • Using the wrong parser or forgetting to specify one.
  • Trying to parse incomplete or malformed HTML without handling errors.
  • Accessing tags without checking if they exist, which causes errors.

Always check if find() returns None before accessing .text.

python
from bs4 import BeautifulSoup

html = '<html><body></body></html>'
soup = BeautifulSoup(html, 'html.parser')

# Wrong: assumes h1 exists
# header = soup.find('h1').text  # This raises AttributeError

# Right: check before accessing
header_tag = soup.find('h1')
if header_tag:
    header = header_tag.text
else:
    header = 'No h1 tag found'

print(header)
Output
No h1 tag found
📊

Quick Reference

Here is a quick summary of useful BeautifulSoup methods:

MethodDescription
BeautifulSoup(html, 'html.parser')Create a soup object to parse HTML
find(tag)Find first occurrence of a tag
find_all(tag)Find all occurrences of a tag
tag.textGet text inside a tag
tag.attrsGet attributes of a tag as a dictionary
soup.select(css_selector)Find elements using CSS selectors

Key Takeaways

Use BeautifulSoup from bs4 to parse HTML easily in Python.
Always specify a parser like 'html.parser' when creating the BeautifulSoup object.
Check if elements exist before accessing their properties to avoid errors.
Install the library first using 'pip install beautifulsoup4'.
Use methods like find(), find_all(), and select() to navigate HTML.