0
0
PythonHow-ToBeginner · 3 min read

How to Extract Links from Webpage in Python Easily

To extract links from a webpage in Python, use the requests library to fetch the page content and BeautifulSoup from bs4 to parse the HTML. Then find all <a> tags and get their href attributes to collect the links.
📐

Syntax

Here is the basic syntax to extract links from a webpage:

  • requests.get(url): Fetches the webpage content.
  • BeautifulSoup(html, 'html.parser'): Parses the HTML content.
  • soup.find_all('a'): Finds all anchor tags.
  • tag.get('href'): Extracts the link URL from each anchor tag.
python
import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = [tag.get('href') for tag in soup.find_all('a')]
print(links)
💻

Example

This example fetches the webpage at http://example.com, parses it, and prints all the links found in the page.

python
import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = [tag.get('href') for tag in soup.find_all('a') if tag.get('href')]
for link in links:
    print(link)
Output
https://www.iana.org/domains/example
⚠️

Common Pitfalls

Common mistakes when extracting links include:

  • Not checking if href exists before accessing it, which can cause errors.
  • Extracting relative URLs without converting them to absolute URLs.
  • Not handling network errors when fetching the page.

Always check for None before using href and consider using urllib.parse.urljoin to get full URLs.

python
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'http://example.com'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')

# Wrong way: might include None or relative URLs
links_wrong = [tag.get('href') for tag in soup.find_all('a')]

# Right way: filter None and convert to absolute URLs
links_right = [urljoin(url, tag.get('href')) for tag in soup.find_all('a') if tag.get('href')]

print('Wrong:', links_wrong)
print('Right:', links_right)
Output
Wrong: ['https://www.iana.org/domains/example'] Right: ['https://www.iana.org/domains/example']
📊

Quick Reference

Tips for extracting links:

  • Use requests to get page content.
  • Parse HTML with BeautifulSoup.
  • Find all <a> tags and get href attributes.
  • Filter out None values.
  • Convert relative URLs to absolute URLs with urllib.parse.urljoin.
  • Handle network errors with try-except.