PythonHow-ToBeginner · 3 min read

How to Extract Links from Webpage in Python Easily

To extract links from a webpage in Python, use the requests library to fetch the page content and BeautifulSoup from bs4 to parse the HTML. Then find all <a> tags and get their href attributes to collect the links.

📐

Syntax

Here is the basic syntax to extract links from a webpage:

requests.get(url): Fetches the webpage content.
BeautifulSoup(html, 'html.parser'): Parses the HTML content.
soup.find_all('a'): Finds all anchor tags.
tag.get('href'): Extracts the link URL from each anchor tag.

python

import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = [tag.get('href') for tag in soup.find_all('a')]
print(links)

💻

Example

This example fetches the webpage at http://example.com, parses it, and prints all the links found in the page.

python

import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = [tag.get('href') for tag in soup.find_all('a') if tag.get('href')]
for link in links:
    print(link)

Output

https://www.iana.org/domains/example

⚠️

Common Pitfalls

Common mistakes when extracting links include:

Not checking if href exists before accessing it, which can cause errors.
Extracting relative URLs without converting them to absolute URLs.
Not handling network errors when fetching the page.

Always check for None before using href and consider using urllib.parse.urljoin to get full URLs.

python

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'http://example.com'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')

# Wrong way: might include None or relative URLs
links_wrong = [tag.get('href') for tag in soup.find_all('a')]

# Right way: filter None and convert to absolute URLs
links_right = [urljoin(url, tag.get('href')) for tag in soup.find_all('a') if tag.get('href')]

print('Wrong:', links_wrong)
print('Right:', links_right)

Output

Wrong: ['https://www.iana.org/domains/example'] Right: ['https://www.iana.org/domains/example']

📊

Quick Reference

Tips for extracting links:

Use requests to get page content.
Parse HTML with BeautifulSoup.
Find all <a> tags and get href attributes.
Filter out None values.
Convert relative URLs to absolute URLs with urllib.parse.urljoin.
Handle network errors with try-except.

✅

Key Takeaways

Use requests and BeautifulSoup to fetch and parse webpage HTML.

Extract links by finding all tags and getting their href attributes.

Always check if href exists before using it to avoid errors.

Convert relative URLs to absolute URLs for reliable links.

Handle network errors when fetching webpages.