0
0
PandasHow-ToBeginner · 3 min read

How to Read HTML Table in Pandas: Simple Guide

Use the pandas.read_html() function to read tables from an HTML page or file. It returns a list of DataFrames, each representing a table found in the HTML content.
📐

Syntax

The basic syntax to read HTML tables with pandas is:

  • pandas.read_html(io, match='.*', flavor=None, header=0, index_col=None, attrs=None)

Here:

  • io: URL, file path, or string containing HTML.
  • match: Regex to filter tables by text.
  • flavor: Parser to use, usually 'bs4' (BeautifulSoup) or 'lxml'. Default is None which tries both.
  • header: Row number to use as column names.
  • index_col: Column to use as index.
  • attrs: Dictionary to filter tables by HTML attributes.
python
import pandas as pd

tables = pd.read_html(io='file_or_url.html', match='.*', flavor=None, header=0, index_col=None, attrs=None)
💻

Example

This example reads tables from a simple HTML string and shows the first table as a DataFrame.

python
import pandas as pd

html = '''
<table>
  <tr><th>Name</th><th>Age</th></tr>
  <tr><td>Alice</td><td>30</td></tr>
  <tr><td>Bob</td><td>25</td></tr>
</table>
'''

tables = pd.read_html(html)
df = tables[0]
print(df)
Output
Name Age 0 Alice 30 1 Bob 25
⚠️

Common Pitfalls

Common mistakes when reading HTML tables include:

  • Not installing lxml or beautifulsoup4 which pandas needs to parse HTML.
  • Assuming read_html returns a single DataFrame; it returns a list of DataFrames.
  • Not specifying match or attrs to filter the correct table when multiple tables exist.
python
import pandas as pd

# Wrong: expecting a single DataFrame
html = '<table><tr><td>1</td></tr></table><table><tr><td>2</td></tr></table>'
df = pd.read_html(html)  # This returns a list

# Right: access the first table explicitly
first_table = pd.read_html(html)[0]
print(first_table)
Output
0 0 1
📊

Quick Reference

ParameterDescriptionDefault
ioURL, file path, or HTML string to read fromRequired
matchRegex to filter tables by text content'.*' (all tables)
flavorParser to use ('bs4' or 'lxml')None
headerRow number to use as column names0
index_colColumn to use as indexNone
attrsDictionary of HTML attributes to filter tablesNone

Key Takeaways

Use pandas.read_html() to extract tables from HTML into DataFrames.
read_html returns a list of DataFrames, one per table found.
Install 'beautifulsoup4' and 'lxml' to enable HTML parsing.
Use 'match' or 'attrs' parameters to select specific tables when multiple exist.
Always check the output list length before accessing tables.