Bird
Raised Fist0
NlpHow-ToBeginner · 3 min read

How to Remove HTML Tags from Text in NLP Easily

To remove HTML tags from text in NLP, you can use Python's re module with a regex pattern like <.*?> to find and delete tags. Alternatively, use libraries like BeautifulSoup which parse HTML and extract clean text easily.
📐

Syntax

Here are two common ways to remove HTML tags from text:

  • Using regex: Use re.sub(r'<.*?>', '', text) to replace tags with empty strings.
  • Using BeautifulSoup: Parse the HTML with BeautifulSoup(text, 'html.parser') and get clean text with .get_text().
python
import re
from bs4 import BeautifulSoup

# Regex syntax to remove HTML tags
def remove_tags_regex(text):
    clean = re.sub(r'<.*?>', '', text)
    return clean

# BeautifulSoup syntax to remove HTML tags
def remove_tags_bs(text):
    soup = BeautifulSoup(text, 'html.parser')
    return soup.get_text()
💻

Example

This example shows how to remove HTML tags from a sample string using both regex and BeautifulSoup methods.

python
import re
from bs4 import BeautifulSoup

sample_text = '<p>Hello, <b>world</b>! Visit <a href="https://example.com">Example</a>.</p>'

# Remove tags using regex
def remove_tags_regex(text):
    return re.sub(r'<.*?>', '', text)

# Remove tags using BeautifulSoup
def remove_tags_bs(text):
    soup = BeautifulSoup(text, 'html.parser')
    return soup.get_text()

print('Original:', sample_text)
print('Regex:', remove_tags_regex(sample_text))
print('BeautifulSoup:', remove_tags_bs(sample_text))
Output
Original: <p>Hello, <b>world</b>! Visit <a href="https://example.com">Example</a>.</p> Regex: Hello, world! Visit Example. BeautifulSoup: Hello, world! Visit Example.
⚠️

Common Pitfalls

Using regex to remove HTML tags can fail with nested tags, attributes containing < or >, or malformed HTML. Regex is not a full HTML parser and may remove text inside tags incorrectly.

BeautifulSoup handles complex HTML better but requires installing an external library.

python
import re
from bs4 import BeautifulSoup

text = '<div>Example <span>text <b>with</b> tags</span></div>'

# Incorrect regex that removes too much
wrong_regex = re.sub(r'<.*>', '', text)

# Correct regex
correct_regex = re.sub(r'<.*?>', '', text)

# BeautifulSoup method
soup = BeautifulSoup(text, 'html.parser')
bs_text = soup.get_text()

print('Wrong regex:', wrong_regex)
print('Correct regex:', correct_regex)
print('BeautifulSoup:', bs_text)
Output
Wrong regex: Correct regex: Example text with tags BeautifulSoup: Example text with tags
📊

Quick Reference

MethodUsageProsCons
Regexre.sub(r'<.*?>', '', text)Simple, no extra installFails on complex HTML, nested tags
BeautifulSoupBeautifulSoup(text, 'html.parser').get_text()Handles complex HTML wellRequires external library

Key Takeaways

Use regex with pattern <.*?> for simple HTML tag removal.
BeautifulSoup is more reliable for complex or malformed HTML.
Avoid greedy regex patterns that remove too much content.
Always test your method on your specific HTML data.
Removing HTML tags is a key preprocessing step in NLP text cleaning.