Pennsylvania COVID-19 Data Scraping with Python

This article was posted more than 1 year ago. Please keep in mind that the information on this page may be outdated, insecure, or just plain wrong today.

Over the last few weeks I’ve been using Python to scrape the Pennsylvania Department of Health’s Coronavirus page. Over time the page has evolved and even split into sub-pages which contain table date of cases, deaths and other statistics.

I’ve decided to put my Python script on GitHub for public consumption. Initially when I had created the script, it was used to send me alerts when the reported numbers changed as there was no set time during the day that the website was updated, so I wanted to set up a cron job to check the website and alert me of new updates.

Below you’ll find the main script. Note that the code below may be out of date, so please check my GitHub repository for the latest. The code changes almost daily as it seems Pennsylvania Dept. of Health changes the structure, or adds new data, to the webpage which throws off my code.

Python Script

import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
import re
url = r'https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx'
html_content = requests.get(url).text
lastupdatedfile = "lastupdated.txt"
soup = BeautifulSoup(html_content, "lxml")
stats = soup.find("span",attrs={"class": "ms-rteStyle-Quote"})
#TODO: byte / string issue here on updatecheck
updatecheck = stats.text[stats.text.find("at "):][3:]
if os.path.isfile(lastupdatedfile):
	lastupdate = open(lastupdatedfile).read()
	if lastupdate == updatecheck:
		print("Skipping check, no new update.")
		os._exit(0)
	else:
		print("***UPDATE***\nOld: {}".format(lastupdate))
		print("New: {}".format(updatecheck))
tables = pd.read_html(html_content, header=0)
df = tables[3]
totalCounties = 67
print("Pennsylvania Data ({})".format(updatecheck))
deathsTotal = int(df["Deaths"].sum())
casesTotal  = int(df["Number of Cases"].sum())
mortalityPercent = round((deathsTotal / casesTotal) * 100,2)
reportingTotal = int(df["County"].count())
reportingCases = df["Number of Cases"]
reportingCasesPct = round((reportingCases.count() / totalCounties) * 100,2)
reportingDeathsObj = df.apply(lambda x: True if x['Deaths'] > 0 else False, axis=1)
reportingDeaths = len(reportingDeathsObj[reportingDeathsObj == True].index)
reportingDeathsPct = round((reportingDeaths / reportingTotal) * 100,2)
print("Total Cases: {}".format(casesTotal))
print("Total Deaths: {}".format(deathsTotal))
print("Mortality Rate(%): {}".format(mortalityPercent))
print("Counties Reporting Cases: {}".format(reportingCases.count()))
print("Counties Reporting Cases(%): {}".format(reportingCasesPct))
print("Counties Reporting Deaths: {}".format(reportingDeaths))
print("Counties Reporting Deaths(% of counties reporting cases): {}".format(reportingDeathsPct))
f=open(lastupdatedfile,"w")
f.write(updatecheck)
f.close()
# Add some notification stuff here...

Output Example

Pennsylvania Data (12:00 p.m. on 3/28/2020)
Total Cases: 2751
Total Deaths: 34
Mortality Rate(%): 1.24
Counties Reporting Cases: 56
Counties Reporting Cases(%): 83.58
Counties Reporting Deaths: 13
Counties Reporting Deaths(% of counties reporting cases): 23.21

#python