How to Read HTML Table in Pandas: Simple Guide
Use the
pandas.read_html() function to read tables from an HTML page or file. It returns a list of DataFrames, each representing a table found in the HTML content.Syntax
The basic syntax to read HTML tables with pandas is:
pandas.read_html(io, match='.*', flavor=None, header=0, index_col=None, attrs=None)
Here:
io: URL, file path, or string containing HTML.match: Regex to filter tables by text.flavor: Parser to use, usually 'bs4' (BeautifulSoup) or 'lxml'. Default is None which tries both.header: Row number to use as column names.index_col: Column to use as index.attrs: Dictionary to filter tables by HTML attributes.
python
import pandas as pd tables = pd.read_html(io='file_or_url.html', match='.*', flavor=None, header=0, index_col=None, attrs=None)
Example
This example reads tables from a simple HTML string and shows the first table as a DataFrame.
python
import pandas as pd html = ''' <table> <tr><th>Name</th><th>Age</th></tr> <tr><td>Alice</td><td>30</td></tr> <tr><td>Bob</td><td>25</td></tr> </table> ''' tables = pd.read_html(html) df = tables[0] print(df)
Output
Name Age
0 Alice 30
1 Bob 25
Common Pitfalls
Common mistakes when reading HTML tables include:
- Not installing
lxmlorbeautifulsoup4which pandas needs to parse HTML. - Assuming
read_htmlreturns a single DataFrame; it returns a list of DataFrames. - Not specifying
matchorattrsto filter the correct table when multiple tables exist.
python
import pandas as pd # Wrong: expecting a single DataFrame html = '<table><tr><td>1</td></tr></table><table><tr><td>2</td></tr></table>' df = pd.read_html(html) # This returns a list # Right: access the first table explicitly first_table = pd.read_html(html)[0] print(first_table)
Output
0
0 1
Quick Reference
| Parameter | Description | Default |
|---|---|---|
| io | URL, file path, or HTML string to read from | Required |
| match | Regex to filter tables by text content | '.*' (all tables) |
| flavor | Parser to use ('bs4' or 'lxml') | None |
| header | Row number to use as column names | 0 |
| index_col | Column to use as index | None |
| attrs | Dictionary of HTML attributes to filter tables | None |
Key Takeaways
Use pandas.read_html() to extract tables from HTML into DataFrames.
read_html returns a list of DataFrames, one per table found.
Install 'beautifulsoup4' and 'lxml' to enable HTML parsing.
Use 'match' or 'attrs' parameters to select specific tables when multiple exist.
Always check the output list length before accessing tables.