0
0
PythonHow-ToBeginner · 3 min read

Extract URLs from Text Using Python: Simple Guide

You can extract URLs from text in Python using the re module with a regular expression pattern that matches URLs. Use re.findall() to find all URLs in a string easily.
📐

Syntax

Use the re.findall(pattern, text) function to find all matches of a URL pattern in the given text.

  • pattern: A regular expression string that describes the URL format.
  • text: The string containing the text to search.
  • The function returns a list of all matching URLs.
python
import re

pattern = r'https?://[\w\.-]+(?:/\S*)?'
text = "Visit https://example.com and http://test.org for info."
urls = re.findall(pattern, text)
print(urls)
Output
['https://example.com', 'http://test.org']
💻

Example

This example shows how to extract all URLs from a text string using a simple regular expression. It prints the list of URLs found.

python
import re

def extract_urls(text):
    pattern = r'https?://[\w\.-]+(?:/\S*)?'
    return re.findall(pattern, text)

sample_text = "Here are some links: https://openai.com, http://github.com, and https://docs.python.org/3/."
urls = extract_urls(sample_text)
print(urls)
Output
['https://openai.com', 'http://github.com', 'https://docs.python.org/3/']
⚠️

Common Pitfalls

Common mistakes include using too simple or incorrect patterns that miss URLs or capture extra characters. Also, forgetting to escape special characters in the pattern can cause errors.

For example, a pattern without https?:// might match partial URLs or text that is not a URL.

python
import re

# Wrong pattern (misses http/https and matches wrong text)
wrong_pattern = r'www\.[\w\.-]+'

text = "Check www.example.com and https://example.com"
wrong_urls = re.findall(wrong_pattern, text)
print("Wrong matches:", wrong_urls)

# Correct pattern
correct_pattern = r'https?://[\w\.-]+(?:/\S*)?'
correct_urls = re.findall(correct_pattern, text)
print("Correct matches:", correct_urls)
Output
Wrong matches: ['www.example.com'] Correct matches: ['https://example.com']
📊

Quick Reference

Tips for extracting URLs:

  • Use re.findall() with a URL pattern.
  • Start pattern with https?:// to match http and https URLs.
  • Use character classes like [\w\.-] to match domain parts.
  • Use non-capturing groups (?:...)? for optional URL paths.
  • Test your pattern with different URL formats.

Key Takeaways

Use Python's re.findall() with a proper URL regex pattern to extract URLs from text.
Include http and https in your pattern to match common web URLs.
Test your regex to avoid missing URLs or capturing wrong text.
Escape special characters in regex patterns to prevent errors.
Simple patterns work for basic URLs but complex URLs may need more advanced regex.