Extracting unique words from all my blog post titles

Had an idea to extract all the unique words from my blog post titles and sort and rank them by frequency. I used MySQL, sed, tr, grep, cat and a little bash script hacked together to do this.

Here’s the top 10 unique words in my blog post titles.

Occurrences	Word
150	Windows
46	Server
32	Cisco
26	Command
25	Microsoft
22	SQL
20	Explorer
19	Linux
18	Internet
18	Error

Here’s how I got to this…

SQL Query

select id,post_title from wp_posts where post_type='post' and post_status='publish'

Bash Script

The script splits each word into a new line and also removes any non-alphanumeric characters sh split.sh > single-words.txt

#!/bin/bash

cat post-titles.csv | while read line
do
    for word in $line
    do
        echo $word | tr -cd '[:alnum:]\n'
    done
done

Cleanup and Sorting

Remove empty lines

sed -i '/^$/d' single-words.txt

Prepare stopwords

wget https://gist.githubusercontent.com/sebleier/554280/raw/ -O stopwords.txt

Remove stopwords from list I have so far.

cat single-words.txt | grep -v -Fix -f stopwords.txt|sort -rn|uniq -c|sort -rn|head -15

And that’s a wrap.

SQL Query

Bash Script

Cleanup and Sorting

Leave a comment Cancel reply