Extracting unique words from all my blog post titles

Had an idea to extract all the unique words from my blog post titles and sort and rank them by frequency. I used MySQL, sed, tr, grep, cat and a little bash script hacked together to do this.

Here’s the top 10 unique words in my blog post titles.

OccurrencesWord
150Windows
46Server
32Cisco
26Command
25Microsoft
22SQL
20Explorer
19Linux
18Internet
18Error

Here’s how I got to this…

SQL Query

select id,post_title from wp_posts where post_type='post' and post_status='publish'

Bash Script

The script splits each word into a new line and also removes any non-alphanumeric characters sh split.sh > single-words.txt

#!/bin/bash

cat post-titles.csv | while read line
do
    for word in $line
    do
        echo $word | tr -cd '[:alnum:]\n'
    done
done

Cleanup and Sorting

Remove empty lines

sed -i '/^$/d' single-words.txt

Prepare stopwords

wget https://gist.githubusercontent.com/sebleier/554280/raw/ -O stopwords.txt

Remove stopwords from list I have so far.

cat single-words.txt | grep -v -Fix -f stopwords.txt|sort -rn|uniq -c|sort -rn|head -15

And that’s a wrap.

Published by

Rich

Just another IT guy.

Leave a Reply

Your email address will not be published. Required fields are marked *