How to Parse HTML in Python: Simple Guide with Examples
To parse HTML in Python, use the
BeautifulSoup library from bs4. Load your HTML content into BeautifulSoup and then use its methods to navigate and extract data from the HTML structure.Syntax
Use BeautifulSoup by importing it from bs4. Create a BeautifulSoup object by passing your HTML string and a parser like html.parser. Then use methods like find() or find_all() to locate elements.
BeautifulSoup(html, 'html.parser'): Parses the HTML string.find(tag): Finds the first occurrence of a tag.find_all(tag): Finds all occurrences of a tag.
python
from bs4 import BeautifulSoup html = '<html><body><h1>Title</h1><p>Paragraph</p></body></html>' soup = BeautifulSoup(html, 'html.parser') first_h1 = soup.find('h1') all_p = soup.find_all('p')
Example
This example shows how to parse a simple HTML string, extract the text inside the first <h1> tag and all <p> tags.
python
from bs4 import BeautifulSoup html = ''' <html> <body> <h1>Welcome to Python</h1> <p>This is a paragraph.</p> <p>This is another paragraph.</p> </body> </html> ''' soup = BeautifulSoup(html, 'html.parser') # Extract first h1 text header = soup.find('h1').text # Extract all paragraph texts paragraphs = [p.text for p in soup.find_all('p')] print('Header:', header) print('Paragraphs:', paragraphs)
Output
Header: Welcome to Python
Paragraphs: ['This is a paragraph.', 'This is another paragraph.']
Common Pitfalls
Common mistakes include:
- Not installing
beautifulsoup4before use (pip install beautifulsoup4). - Using the wrong parser or forgetting to specify one.
- Trying to parse incomplete or malformed HTML without handling errors.
- Accessing tags without checking if they exist, which causes errors.
Always check if find() returns None before accessing .text.
python
from bs4 import BeautifulSoup html = '<html><body></body></html>' soup = BeautifulSoup(html, 'html.parser') # Wrong: assumes h1 exists # header = soup.find('h1').text # This raises AttributeError # Right: check before accessing header_tag = soup.find('h1') if header_tag: header = header_tag.text else: header = 'No h1 tag found' print(header)
Output
No h1 tag found
Quick Reference
Here is a quick summary of useful BeautifulSoup methods:
| Method | Description |
|---|---|
| BeautifulSoup(html, 'html.parser') | Create a soup object to parse HTML |
| find(tag) | Find first occurrence of a tag |
| find_all(tag) | Find all occurrences of a tag |
| tag.text | Get text inside a tag |
| tag.attrs | Get attributes of a tag as a dictionary |
| soup.select(css_selector) | Find elements using CSS selectors |
Key Takeaways
Use BeautifulSoup from bs4 to parse HTML easily in Python.
Always specify a parser like 'html.parser' when creating the BeautifulSoup object.
Check if elements exist before accessing their properties to avoid errors.
Install the library first using 'pip install beautifulsoup4'.
Use methods like find(), find_all(), and select() to navigate HTML.