0
0
PythonHow-ToBeginner · 3 min read

How to Extract Text from Webpage in Python Easily

To extract text from a webpage in Python, use the requests library to fetch the page content and BeautifulSoup from bs4 to parse the HTML and get the text. This method lets you easily access all readable text from the webpage.
📐

Syntax

Use requests.get(url) to download the webpage content. Then create a BeautifulSoup object with the HTML content and a parser like html.parser. Finally, use .get_text() on the soup object to extract all text.

  • requests.get(url): Fetches the webpage.
  • BeautifulSoup(html, 'html.parser'): Parses HTML content.
  • .get_text(): Extracts all text from parsed HTML.
python
import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text = soup.get_text()
💻

Example

This example fetches the text from the example.com homepage and prints it. It shows how to use requests and BeautifulSoup together to get readable text from a webpage.

python
import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text = soup.get_text()
print(text.strip())
Output
Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information...
⚠️

Common Pitfalls

Common mistakes include:

  • Not checking if the requests.get() call was successful before parsing.
  • Parsing JavaScript-generated content which requests cannot see.
  • Extracting text without cleaning whitespace or unwanted tags.

Always check response.status_code and consider tools like Selenium for dynamic pages.

python
import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
response = requests.get(url)

# Wrong: parsing without checking response
# soup = BeautifulSoup(response.text, 'html.parser')

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    text = soup.get_text(strip=True)
    print(text)
else:
    print(f'Failed to retrieve page: {response.status_code}')
Output
Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information...
📊

Quick Reference

  • requests.get(url): Download webpage content.
  • BeautifulSoup(html, 'html.parser'): Parse HTML content.
  • .get_text(): Extract all text from HTML.
  • Check response.status_code: Ensure page loaded successfully.
  • Use Selenium: For pages with JavaScript content.

Key Takeaways

Use requests to fetch webpage HTML and BeautifulSoup to parse and extract text.
Always check the HTTP response status before parsing the content.
Requests cannot handle JavaScript-generated content; use Selenium for that.
Use .get_text() method to get all readable text from parsed HTML.
Clean extracted text as needed to remove extra whitespace or unwanted characters.