How to Remove HTML Tags from Text in NLP Easily
To remove
HTML tags from text in NLP, you can use Python's re module with a regex pattern like <.*?> to find and delete tags. Alternatively, use libraries like BeautifulSoup which parse HTML and extract clean text easily.Syntax
Here are two common ways to remove HTML tags from text:
- Using regex: Use
re.sub(r'<.*?>', '', text)to replace tags with empty strings. - Using BeautifulSoup: Parse the HTML with
BeautifulSoup(text, 'html.parser')and get clean text with.get_text().
python
import re from bs4 import BeautifulSoup # Regex syntax to remove HTML tags def remove_tags_regex(text): clean = re.sub(r'<.*?>', '', text) return clean # BeautifulSoup syntax to remove HTML tags def remove_tags_bs(text): soup = BeautifulSoup(text, 'html.parser') return soup.get_text()
Example
This example shows how to remove HTML tags from a sample string using both regex and BeautifulSoup methods.
python
import re from bs4 import BeautifulSoup sample_text = '<p>Hello, <b>world</b>! Visit <a href="https://example.com">Example</a>.</p>' # Remove tags using regex def remove_tags_regex(text): return re.sub(r'<.*?>', '', text) # Remove tags using BeautifulSoup def remove_tags_bs(text): soup = BeautifulSoup(text, 'html.parser') return soup.get_text() print('Original:', sample_text) print('Regex:', remove_tags_regex(sample_text)) print('BeautifulSoup:', remove_tags_bs(sample_text))
Output
Original: <p>Hello, <b>world</b>! Visit <a href="https://example.com">Example</a>.</p>
Regex: Hello, world! Visit Example.
BeautifulSoup: Hello, world! Visit Example.
Common Pitfalls
Using regex to remove HTML tags can fail with nested tags, attributes containing < or >, or malformed HTML. Regex is not a full HTML parser and may remove text inside tags incorrectly.
BeautifulSoup handles complex HTML better but requires installing an external library.
python
import re from bs4 import BeautifulSoup text = '<div>Example <span>text <b>with</b> tags</span></div>' # Incorrect regex that removes too much wrong_regex = re.sub(r'<.*>', '', text) # Correct regex correct_regex = re.sub(r'<.*?>', '', text) # BeautifulSoup method soup = BeautifulSoup(text, 'html.parser') bs_text = soup.get_text() print('Wrong regex:', wrong_regex) print('Correct regex:', correct_regex) print('BeautifulSoup:', bs_text)
Output
Wrong regex:
Correct regex: Example text with tags
BeautifulSoup: Example text with tags
Quick Reference
| Method | Usage | Pros | Cons |
|---|---|---|---|
| Regex | re.sub(r'<.*?>', '', text) | Simple, no extra install | Fails on complex HTML, nested tags |
| BeautifulSoup | BeautifulSoup(text, 'html.parser').get_text() | Handles complex HTML well | Requires external library |
Key Takeaways
Use regex with pattern
<.*?> for simple HTML tag removal.BeautifulSoup is more reliable for complex or malformed HTML.
Avoid greedy regex patterns that remove too much content.
Always test your method on your specific HTML data.
Removing HTML tags is a key preprocessing step in NLP text cleaning.
