0
0
Bash-scriptingHow-ToBeginner · 2 min read

Bash Script to Find Word Frequency in File

Use tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr in Bash to find and display word frequency in a file.
📋

Examples

Inputhello world hello
Output2 hello 1 world
Inputapple banana apple orange banana apple
Output3 apple 2 banana 1 orange
Input
Output
🧠

How to Think About It

To find word frequency, first split the text into words by replacing non-alphanumeric characters with new lines. Then sort the words so duplicates come together. Count each unique word's occurrences and finally sort the counts in descending order to show the most frequent words first.
📐

Algorithm

1
Read the file content.
2
Replace all non-alphanumeric characters with new lines to isolate words.
3
Sort the list of words alphabetically.
4
Count the occurrences of each unique word.
5
Sort the counted words by frequency in descending order.
6
Display the word counts.
💻

Code

bash
#!/bin/bash

if [ $# -ne 1 ]; then
  echo "Usage: $0 filename"
  exit 1
fi

tr -cs '[:alnum:]' '\n' < "$1" | sort | uniq -c | sort -nr | awk '{print $1, $2}'
Output
2 hello 1 world
🔍

Dry Run

Let's trace the input 'hello world hello' through the code

1

Replace non-alphanumeric with new lines

Input: 'hello world hello' -> Output: 'hello\nworld\nhello'

2

Sort words

Words: ['hello', 'hello', 'world'] sorted -> ['hello', 'hello', 'world']

3

Count unique words

'hello' appears 2 times, 'world' appears 1 time

WordCount
hello2
world1
💡

Why This Works

Step 1: Splitting words

The tr -cs '[:alnum:]' '\n' command replaces all characters except letters and numbers with new lines, isolating each word on its own line.

Step 2: Sorting words

Sorting groups identical words together so uniq -c can count consecutive duplicates.

Step 3: Counting and sorting frequency

uniq -c counts occurrences, and sort -nr sorts the counts in descending order to show the most frequent words first.

🔄

Alternative Approaches

awk script
bash
awk '{for(i=1;i<=NF;i++) freq[$i]++} END {for(word in freq) print freq[word], word}' filename | sort -nr
Uses awk to count words directly without multiple commands; simpler but less portable for complex word splitting.
grep and sort
bash
grep -oE '\w+' filename | sort | uniq -c | sort -nr
Uses grep to extract words; simpler but depends on grep supporting -o and -E options.

Complexity: O(n log n) time, O(n) space

Time Complexity

Sorting the words dominates time at O(n log n), where n is the number of words.

Space Complexity

Extra space is needed to store all words and counts, so O(n) space is used.

Which Approach is Fastest?

The tr | sort | uniq pipeline is efficient and simple; awk can be faster for very large files but may be less precise in splitting.

ApproachTimeSpaceBest For
tr + sort + uniqO(n log n)O(n)General use, reliable word splitting
awk countingO(n)O(n)Direct counting, simpler code
grep + sort + uniqO(n log n)O(n)Quick extraction if grep supports options
💡
Use tr -cs '[:alnum:]' '\n' to split text into words cleanly in Bash.
⚠️
Forgetting to normalize word boundaries causes incorrect word counts due to punctuation.