How to Extract Text from Webpage in Python Easily
To extract text from a webpage in Python, use the
requests library to fetch the page content and BeautifulSoup from bs4 to parse the HTML and get the text. This method lets you easily access all readable text from the webpage.Syntax
Use requests.get(url) to download the webpage content. Then create a BeautifulSoup object with the HTML content and a parser like html.parser. Finally, use .get_text() on the soup object to extract all text.
requests.get(url): Fetches the webpage.BeautifulSoup(html, 'html.parser'): Parses HTML content..get_text(): Extracts all text from parsed HTML.
python
import requests from bs4 import BeautifulSoup url = 'https://example.com' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') text = soup.get_text()
Example
This example fetches the text from the example.com homepage and prints it. It shows how to use requests and BeautifulSoup together to get readable text from a webpage.
python
import requests from bs4 import BeautifulSoup url = 'https://example.com' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') text = soup.get_text() print(text.strip())
Output
Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
More information...
Common Pitfalls
Common mistakes include:
- Not checking if the
requests.get()call was successful before parsing. - Parsing JavaScript-generated content which
requestscannot see. - Extracting text without cleaning whitespace or unwanted tags.
Always check response.status_code and consider tools like Selenium for dynamic pages.
python
import requests from bs4 import BeautifulSoup url = 'https://example.com' response = requests.get(url) # Wrong: parsing without checking response # soup = BeautifulSoup(response.text, 'html.parser') if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') text = soup.get_text(strip=True) print(text) else: print(f'Failed to retrieve page: {response.status_code}')
Output
Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
More information...
Quick Reference
- requests.get(url): Download webpage content.
- BeautifulSoup(html, 'html.parser'): Parse HTML content.
- .get_text(): Extract all text from HTML.
- Check response.status_code: Ensure page loaded successfully.
- Use Selenium: For pages with JavaScript content.
Key Takeaways
Use requests to fetch webpage HTML and BeautifulSoup to parse and extract text.
Always check the HTTP response status before parsing the content.
Requests cannot handle JavaScript-generated content; use Selenium for that.
Use .get_text() method to get all readable text from parsed HTML.
Clean extracted text as needed to remove extra whitespace or unwanted characters.