0
0
Bash-scriptingHow-ToBeginner · 2 min read

Bash Script to Extract URLs from File

Use grep -oE 'https?://[^ "\'>]+' filename in Bash to extract all URLs from a file by matching http or https links.
📋

Examples

InputVisit https://example.com and http://test.org for info.
Outputhttps://example.com http://test.org
InputNo URLs here, just text.
Output
InputCheck https://site.com/page?query=1 and https://another-site.org.
Outputhttps://site.com/page?query=1 https://another-site.org
🧠

How to Think About It

To extract URLs from a file, look for patterns starting with http or https followed by characters that are valid in URLs. Use a regular expression with grep to find and print only those matches.
📐

Algorithm

1
Read the input file line by line.
2
Search each line for substrings that start with 'http://' or 'https://'.
3
Extract the matching substrings that form valid URLs.
4
Print each found URL on its own line.
💻

Code

bash
#!/bin/bash

# Extract URLs from a file passed as argument
if [ $# -eq 0 ]; then
  echo "Usage: $0 filename"
  exit 1
fi

filename="$1"
grep -oE 'https?://[^ "\'>]+' "$filename"
Output
https://example.com http://test.org
🔍

Dry Run

Let's trace the input 'Visit https://example.com and http://test.org for info.' through the code

1

Read line

Line: Visit https://example.com and http://test.org for info.

2

Apply grep regex

Matches found: https://example.com, http://test.org

3

Print matches

Output lines: https://example.com http://test.org

StepActionValue
1Read lineVisit https://example.com and http://test.org for info.
2Extract URLshttps://example.com, http://test.org
3Print URLshttps://example.com http://test.org
💡

Why This Works

Step 1: Use grep with regex

The grep -oE command searches for patterns and prints only the matched parts, not the whole line.

Step 2: Regex pattern explained

The pattern https?://[^ "\'>]+ matches 'http' or 'https', then '://', then any characters except space, quotes, or angle brackets, which usually end URLs.

Step 3: Extract URLs line by line

This approach reads each line and extracts all URLs found, printing each on its own line for easy use.

🔄

Alternative Approaches

Using awk
bash
awk '{while(match($0, /https?:\/\/[^ "\'>]+/)){print substr($0, RSTART, RLENGTH); $0=substr($0, RSTART+RLENGTH)}}' filename
More flexible but slightly more complex; good for processing multiple URLs per line.
Using sed
bash
sed -n 's/.*\(https\?:\/\/[^ "\'>]*\).*/\1/p' filename
Simpler but extracts only the first URL per line.

Complexity: O(n) time, O(k) space

Time Complexity

The script reads each line once and applies a regex match, so time grows linearly with file size (O(n)).

Space Complexity

Only stores matched URLs temporarily, so space depends on number of URLs found (O(k)), generally small compared to input.

Which Approach is Fastest?

Using grep is fastest and simplest for extracting URLs; awk is flexible but slower; sed is limited to first URL per line.

ApproachTimeSpaceBest For
grep with regexO(n)O(k)Simple, fast URL extraction
awk with match loopO(n)O(k)Multiple URLs per line, flexible processing
sed substitutionO(n)O(1)Extracting first URL per line, simpler cases
💡
Use grep -oE with a simple regex to quickly extract URLs from text files.
⚠️
Beginners often forget to use the -o option in grep, which causes the whole line to print instead of just URLs.