Had an idea to extract all the unique words from my blog post titles and sort and rank them by frequency. I used MySQL, sed, tr, grep, cat and a little bash script hacked together to do this.
Here’s the top 10 unique words in my blog post titles.
| Occurrences | Word |
|---|---|
| 150 | Windows |
| 46 | Server |
| 32 | Cisco |
| 26 | Command |
| 25 | Microsoft |
| 22 | SQL |
| 20 | Explorer |
| 19 | Linux |
| 18 | Internet |
| 18 | Error |
Here’s how I got to this…
SQL Query
select id,post_title from wp_posts where post_type='post' and post_status='publish'
Bash Script
The script splits each word into a new line and also removes any non-alphanumeric characters sh split.sh > single-words.txt
#!/bin/bash
cat post-titles.csv | while read line
do
for word in $line
do
echo $word | tr -cd '[:alnum:]\n'
done
done
Cleanup and Sorting
Remove empty lines
sed -i '/^$/d' single-words.txt
Prepare stopwords
wget https://gist.githubusercontent.com/sebleier/554280/raw/ -O stopwords.txt
Remove stopwords from list I have so far.
cat single-words.txt | grep -v -Fix -f stopwords.txt|sort -rn|uniq -c|sort -rn|head -15
And that’s a wrap.