PandasHow-ToBeginner · 3 min read

How to Read HTML Table in Pandas: Simple Guide

Use the pandas.read_html() function to read tables from an HTML page or file. It returns a list of DataFrames, each representing a table found in the HTML content.

📐

Syntax

The basic syntax to read HTML tables with pandas is:

pandas.read_html(io, match='.*', flavor=None, header=0, index_col=None, attrs=None)

Here:

io: URL, file path, or string containing HTML.
match: Regex to filter tables by text.
flavor: Parser to use, usually 'bs4' (BeautifulSoup) or 'lxml'. Default is None which tries both.
header: Row number to use as column names.
index_col: Column to use as index.
attrs: Dictionary to filter tables by HTML attributes.

python

import pandas as pd

tables = pd.read_html(io='file_or_url.html', match='.*', flavor=None, header=0, index_col=None, attrs=None)

💻

Example

This example reads tables from a simple HTML string and shows the first table as a DataFrame.

python

import pandas as pd

html = '''
<table>
  <tr><th>Name</th><th>Age</th></tr>
  <tr><td>Alice</td><td>30</td></tr>
  <tr><td>Bob</td><td>25</td></tr>
</table>
'''

tables = pd.read_html(html)
df = tables[0]
print(df)

Output

Name Age 0 Alice 30 1 Bob 25

⚠️

Common Pitfalls

Common mistakes when reading HTML tables include:

Not installing lxml or beautifulsoup4 which pandas needs to parse HTML.
Assuming read_html returns a single DataFrame; it returns a list of DataFrames.
Not specifying match or attrs to filter the correct table when multiple tables exist.

python

import pandas as pd

# Wrong: expecting a single DataFrame
html = '<table><tr><td>1</td></tr></table><table><tr><td>2</td></tr></table>'
df = pd.read_html(html)  # This returns a list

# Right: access the first table explicitly
first_table = pd.read_html(html)[0]
print(first_table)

Output

0 0 1

📊

Quick Reference

Parameter	Description	Default
io	URL, file path, or HTML string to read from	Required
match	Regex to filter tables by text content	'.*' (all tables)
flavor	Parser to use ('bs4' or 'lxml')	None
header	Row number to use as column names	0
index_col	Column to use as index	None
attrs	Dictionary of HTML attributes to filter tables	None

✅

Key Takeaways

Use pandas.read_html() to extract tables from HTML into DataFrames.

read_html returns a list of DataFrames, one per table found.

Install 'beautifulsoup4' and 'lxml' to enable HTML parsing.

Use 'match' or 'attrs' parameters to select specific tables when multiple exist.

Always check the output list length before accessing tables.