Extract URLs from Text Using Python: Simple Guide
You can extract URLs from text in Python using the
re module with a regular expression pattern that matches URLs. Use re.findall() to find all URLs in a string easily.Syntax
Use the re.findall(pattern, text) function to find all matches of a URL pattern in the given text.
pattern: A regular expression string that describes the URL format.text: The string containing the text to search.- The function returns a list of all matching URLs.
python
import re pattern = r'https?://[\w\.-]+(?:/\S*)?' text = "Visit https://example.com and http://test.org for info." urls = re.findall(pattern, text) print(urls)
Output
['https://example.com', 'http://test.org']
Example
This example shows how to extract all URLs from a text string using a simple regular expression. It prints the list of URLs found.
python
import re def extract_urls(text): pattern = r'https?://[\w\.-]+(?:/\S*)?' return re.findall(pattern, text) sample_text = "Here are some links: https://openai.com, http://github.com, and https://docs.python.org/3/." urls = extract_urls(sample_text) print(urls)
Output
['https://openai.com', 'http://github.com', 'https://docs.python.org/3/']
Common Pitfalls
Common mistakes include using too simple or incorrect patterns that miss URLs or capture extra characters. Also, forgetting to escape special characters in the pattern can cause errors.
For example, a pattern without https?:// might match partial URLs or text that is not a URL.
python
import re # Wrong pattern (misses http/https and matches wrong text) wrong_pattern = r'www\.[\w\.-]+' text = "Check www.example.com and https://example.com" wrong_urls = re.findall(wrong_pattern, text) print("Wrong matches:", wrong_urls) # Correct pattern correct_pattern = r'https?://[\w\.-]+(?:/\S*)?' correct_urls = re.findall(correct_pattern, text) print("Correct matches:", correct_urls)
Output
Wrong matches: ['www.example.com']
Correct matches: ['https://example.com']
Quick Reference
Tips for extracting URLs:
- Use
re.findall()with a URL pattern. - Start pattern with
https?://to match http and https URLs. - Use character classes like
[\w\.-]to match domain parts. - Use non-capturing groups
(?:...)?for optional URL paths. - Test your pattern with different URL formats.
Key Takeaways
Use Python's re.findall() with a proper URL regex pattern to extract URLs from text.
Include http and https in your pattern to match common web URLs.
Test your regex to avoid missing URLs or capturing wrong text.
Escape special characters in regex patterns to prevent errors.
Simple patterns work for basic URLs but complex URLs may need more advanced regex.