Pennsylvania COVID-19 Data Scraping with Python

Over the last few weeks I’ve been using Python to scrape the Pennsylvania Department of Health’s Coronavirus page. Over time the page has evolved and even split into sub-pages which contain table date of cases, deaths and other statistics.

I’ve decided to put my Python script on GitHub for public consumption. Initially when I had created the script, it was used to send me alerts when the reported numbers changed as there was no set time during the day that the website was updated, so I wanted to set up a cron job to check the website and alert me of new updates.

Below you’ll find the main script. Note that the code below may be out of date, so please check my GitHub repository for the latest. The code changes almost daily as it seems Pennsylvania Dept. of Health changes the structure, or adds new data, to the webpage which throws off my code.

Python Script

import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
import re
url = r'https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx'
html_content = requests.get(url).text
lastupdatedfile = "lastupdated.txt"
soup = BeautifulSoup(html_content, "lxml")
stats = soup.find("span",attrs={"class": "ms-rteStyle-Quote"})
#TODO: byte / string issue here on updatecheck
updatecheck = stats.text[stats.text.find("at "):][3:]
if os.path.isfile(lastupdatedfile):
	lastupdate = open(lastupdatedfile).read()
	if lastupdate == updatecheck:
		print("Skipping check, no new update.")
		os._exit(0)
	else:
		print("***UPDATE***\nOld: {}".format(lastupdate))
		print("New: {}".format(updatecheck))
tables = pd.read_html(html_content, header=0)
df = tables[3]
totalCounties = 67
print("Pennsylvania Data ({})".format(updatecheck))
deathsTotal = int(df["Deaths"].sum())
casesTotal  = int(df["Number of Cases"].sum())
mortalityPercent = round((deathsTotal / casesTotal) * 100,2)
reportingTotal = int(df["County"].count())
reportingCases = df["Number of Cases"]
reportingCasesPct = round((reportingCases.count() / totalCounties) * 100,2)
reportingDeathsObj = df.apply(lambda x: True if x['Deaths'] > 0 else False, axis=1)
reportingDeaths = len(reportingDeathsObj[reportingDeathsObj == True].index)
reportingDeathsPct = round((reportingDeaths / reportingTotal) * 100,2)
print("Total Cases: {}".format(casesTotal))
print("Total Deaths: {}".format(deathsTotal))
print("Mortality Rate(%): {}".format(mortalityPercent))
print("Counties Reporting Cases: {}".format(reportingCases.count()))
print("Counties Reporting Cases(%): {}".format(reportingCasesPct))
print("Counties Reporting Deaths: {}".format(reportingDeaths))
print("Counties Reporting Deaths(% of counties reporting cases): {}".format(reportingDeathsPct))
f=open(lastupdatedfile,"w")
f.write(updatecheck)
f.close()
# Add some notification stuff here...

Output Example

Pennsylvania Data (12:00 p.m. on 3/28/2020)
Total Cases: 2751
Total Deaths: 34
Mortality Rate(%): 1.24
Counties Reporting Cases: 56
Counties Reporting Cases(%): 83.58
Counties Reporting Deaths: 13
Counties Reporting Deaths(% of counties reporting cases): 23.21

Login to WordPress from Python

I’ve been trying to learn some Python and have been tinkering with the requests module. Here is how I am able to log into a webpage, such as WordPress.

import requests
 url = "https://techish.net/wp-login.php"
 redirect_to  = "https://techish.net/wp-admin/"
 with requests.Session() as session:
     post = session.post(url, data={
         'log': 'admin',
         'pwd': 'password',
         'redirect_to': redirect_to
         }, allow_redirects=True)
     get = session.get(redirect_to, cookies=post.cookies)
     print(get.text)

3% of Government Websites Still Remain Unpatched Against the OpenSSL "Heartbleed Bug"

Yesterday, I collected over 1,200 .GOV TLD domains and ran checks against them.  Of that, 58 were affected by the OpenSSL bug, aka, Heartbleed.  This morning, upon checking again, only 39 remain unpatched of that initial 58 affected.

During my testing I was able to inadvertently obtain login credentials for a particular .GOV website illustrated in the screenshot below.

Heartbleed affected .GOV website showing user credentials
Heartbleed affected .GOV website showing user credentials

I collected the .GOV domains from http://www.data.gov/.  I cooked a simple bash script loop against this list and passed it to a Proof of Concept “check” tool to determine if the site was unpatched.  The tool I used is https://gist.github.com/takeshixx/10107280 (python).
Continue reading 3% of Government Websites Still Remain Unpatched Against the OpenSSL "Heartbleed Bug"

Check for configuration errors with FAM/Gamin Library

Popup in Outlook and webmail:

Your IMAP server wants to alert you to the following: filesystem notification initialization error — contact your mail administrator (check for configuration errors with the FAM/Gamin library)

I have a Courier IMAP+SASL+Maildrop+Postfix+MySQL setup.
I don’t know what the root problem is; I just know things have been working until recently updating the system (which inevitably broke something).

root@node1:# apt-cache search fam gamin
gamin - File and directory monitoring system
libgamin-dev - Development files for the gamin client library
libgamin0 - Client library for the gamin file and directory monitoring system
python-gamin - Python binding for the gamin client library
kdelibs4c2a - core libraries and binaries for all KDE applications
root@node1:# apt-get install gamin
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  sensible-mda tnef sendmail-cf sendmail-base daemon libnet-cidr-lite-perl
  clamav-daemon
Use 'apt-get autoremove' to remove them.
The following extra packages will be installed:
  libgamin0 libglib2.0-0 libglib2.0-data shared-mime-info
The following packages will be REMOVED:
  libfam0
The following NEW packages will be installed:
  gamin libgamin0 libglib2.0-0 libglib2.0-data shared-mime-info
0 upgraded, 5 newly installed, 1 to remove and 32 not upgraded.
Need to get 3072 kB of archives.
After this operation, 10.5 MB of additional disk space will be used.
Do you want to continue [Y/n]? y
Get:1 http://ftp.us.debian.org/debian/ squeeze/main libglib2.0-0 amd64 2.24.2-1 [1122 kB]
Get:2 http://ftp.us.debian.org/debian/ squeeze/main libgamin0 amd64 0.1.10-2+b1 [42.3 kB]
Get:3 http://ftp.us.debian.org/debian/ squeeze/main gamin amd64 0.1.10-2+b1 [72.9 kB]
Get:4 http://ftp.us.debian.org/debian/ squeeze/main libglib2.0-data all 2.24.2-1 [994 kB]
Get:5 http://ftp.us.debian.org/debian/ squeeze/main shared-mime-info amd64 0.71-4 [841 kB]
Fetched 3072 kB in 2s (1085 kB/s)
dpkg: libfam0: dependency problems, but removing anyway as you requested:
 courier-base depends on libfam0.
 courier-imap depends on libfam0.
(Reading database ... 31853 files and directories currently installed.)
Removing libfam0 ...
Selecting previously deselected package libglib2.0-0.
(Reading database ... 31845 files and directories currently installed.)
Unpacking libglib2.0-0 (from .../libglib2.0-0_2.24.2-1_amd64.deb) ...
Selecting previously deselected package libgamin0.
Unpacking libgamin0 (from .../libgamin0_0.1.10-2+b1_amd64.deb) ...
Selecting previously deselected package gamin.
Unpacking gamin (from .../gamin_0.1.10-2+b1_amd64.deb) ...
Selecting previously deselected package libglib2.0-data.
Unpacking libglib2.0-data (from .../libglib2.0-data_2.24.2-1_all.deb) ...
Selecting previously deselected package shared-mime-info.
Unpacking shared-mime-info (from .../shared-mime-info_0.71-4_amd64.deb) ...
Processing triggers for man-db ...
Setting up libglib2.0-0 (2.24.2-1) ...
Setting up libglib2.0-data (2.24.2-1) ...
Setting up shared-mime-info (0.71-4) ...
Setting up gamin (0.1.10-2+b1) ...
Setting up libgamin0 (0.1.10-2+b1) ...
root@node1:# /etc/init.d/courier-imap restart
Stopping Courier IMAP server: imapd.
Starting Courier IMAP server: imapd.
root@node1:# /etc/init.d/courier-imap
courier-imap      courier-imap-ssl
root@node1:# /etc/init.d/courier-imap-ssl restart
Stopping Courier IMAP-SSL server: imapd-ssl.
Starting Courier IMAP-SSL server: imapd-ssl.
root@node1:# dpkg -l |grep -i libfam|gamin
ii  gamin                               0.1.10-2+b1                  File and directory monitoring system
rc  libfam0                             2.7.0-17                     Client library to control the FAM daemon
ii  libgamin0                           0.1.10-2+b1                  Client library for the gamin file and directory monitoring system

libfam0:

Description: Client library to control the FAM daemon
FAM monitors files and directories, notifying interested applications
of changes.
.
This package provides a shared library to allow programs to connect to
the FAM daemon and ask for files to be monitored.
Homepage: http://oss.sgi.com/projects/fam/

gamin

Description: File and directory monitoring system
Gamin is a file and directory monitoring system which allows
applications to detect when a file or a directory has been added,
removed or modified by somebody else.
.
It can be used by desktops like KDE, GNOME or Xfce to have their
virtual file systems keep track of changes to files and directories.
For example, if a file manager displays a directory to the user, and
the user removes one of the files via the command-line, gamin will
notify the file manager of this change so that it can update the
directory display.
.
Gamin has been designed as a drop-in replacement for FAM with security
and maintainability in mind and can use Linux’s advanced inotify
service when available.

All I know is things work again. I’ll dig into this some other day. In with the new out with the old.