Bash-scriptingHow-ToBeginner · 2 min read

Bash Script to Extract URLs from File

Use grep -oE 'https?://[^ "\'>]+' filename in Bash to extract all URLs from a file by matching http or https links.

📋

Examples

InputVisit https://example.com and http://test.org for info.

Outputhttps://example.com http://test.org

InputNo URLs here, just text.

Output

InputCheck https://site.com/page?query=1 and https://another-site.org.

Outputhttps://site.com/page?query=1 https://another-site.org

🧠

How to Think About It

To extract URLs from a file, look for patterns starting with http or https followed by characters that are valid in URLs. Use a regular expression with grep to find and print only those matches.

📐

Algorithm

Read the input file line by line.

Search each line for substrings that start with 'http://' or 'https://'.

Extract the matching substrings that form valid URLs.

Print each found URL on its own line.

💻

Code

bash

#!/bin/bash

# Extract URLs from a file passed as argument
if [ $# -eq 0 ]; then
  echo "Usage: $0 filename"
  exit 1
fi

filename="$1"
grep -oE 'https?://[^ "\'>]+' "$filename"

Output

https://example.com http://test.org

🔍

Dry Run

Let's trace the input 'Visit https://example.com and http://test.org for info.' through the code

Read line

Line: Visit https://example.com and http://test.org for info.

Apply grep regex

Matches found: https://example.com, http://test.org

Print matches

Output lines: https://example.com http://test.org

Step	Action	Value
1	Read line	Visit https://example.com and http://test.org for info.
2	Extract URLs	https://example.com, http://test.org
3	Print URLs	https://example.com http://test.org

💡

Why This Works

Step 1: Use grep with regex

The grep -oE command searches for patterns and prints only the matched parts, not the whole line.

Step 2: Regex pattern explained

The pattern https?://[^ "\'>]+ matches 'http' or 'https', then '://', then any characters except space, quotes, or angle brackets, which usually end URLs.

Step 3: Extract URLs line by line

This approach reads each line and extracts all URLs found, printing each on its own line for easy use.

🔄

Alternative Approaches

Using awk

bash

awk '{while(match($0, /https?:\/\/[^ "\'>]+/)){print substr($0, RSTART, RLENGTH); $0=substr($0, RSTART+RLENGTH)}}' filename

More flexible but slightly more complex; good for processing multiple URLs per line.

Using sed

bash

sed -n 's/.*\(https\?:\/\/[^ "\'>]*\).*/\1/p' filename

Simpler but extracts only the first URL per line.

⚡

Complexity: O(n) time, O(k) space

Time Complexity

The script reads each line once and applies a regex match, so time grows linearly with file size (O(n)).

Space Complexity

Only stores matched URLs temporarily, so space depends on number of URLs found (O(k)), generally small compared to input.

Which Approach is Fastest?

Using grep is fastest and simplest for extracting URLs; awk is flexible but slower; sed is limited to first URL per line.

Approach	Time	Space	Best For
grep with regex	O(n)	O(k)	Simple, fast URL extraction
awk with match loop	O(n)	O(k)	Multiple URLs per line, flexible processing
sed substitution	O(n)	O(1)	Extracting first URL per line, simpler cases

💡

Use grep -oE with a simple regex to quickly extract URLs from text files.

⚠️

Beginners often forget to use the -o option in grep, which causes the whole line to print instead of just URLs.