Had an idea to extract all the unique words from my blog post titles and sort and rank them by frequency. I used MySQL, sed, tr, grep, cat and a little bash script hacked together to do this.
Here’s the top 10 unique words in my blog post titles.
Occurrences | Word |
---|---|
150 | Windows |
46 | Server |
32 | Cisco |
26 | Command |
25 | Microsoft |
22 | SQL |
20 | Explorer |
19 | Linux |
18 | Internet |
18 | Error |
Here’s how I got to this…
SQL Query
select id,post_title from wp_posts where post_type='post' and post_status='publish'
Bash Script
The script splits each word into a new line and also removes any non-alphanumeric characters sh split.sh > single-words.txt
#!/bin/bash cat post-titles.csv | while read line do for word in $line do echo $word | tr -cd '[:alnum:]\n' done done
Cleanup and Sorting
Remove empty lines
sed -i '/^$/d' single-words.txt
Prepare stopwords
wget https://gist.githubusercontent.com/sebleier/554280/raw/ -O stopwords.txt
Remove stopwords from list I have so far.
cat single-words.txt | grep -v -Fix -f stopwords.txt|sort -rn|uniq -c|sort -rn|head -15
And that’s a wrap.