Over the last few weeks I’ve been using Python to scrape the Pennsylvania Department of Health’s Coronavirus page. Over time the page has evolved and even split into sub-pages which contain table date of cases, deaths and other statistics.
I’ve decided to put my Python script on GitHub for public consumption. Initially when I had created the script, it was used to send me alerts when the reported numbers changed as there was no set time during the day that the website was updated, so I wanted to set up a cron job to check the website and alert me of new updates.
Below you’ll find the main script. Note that the code below may be out of date, so please check my GitHub repository for the latest. The code changes almost daily as it seems Pennsylvania Dept. of Health changes the structure, or adds new data, to the webpage which throws off my code.
Python Script
import pandas as pd from bs4 import BeautifulSoup import requests import os import re url = r'https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx' html_content = requests.get(url).text lastupdatedfile = "lastupdated.txt" soup = BeautifulSoup(html_content, "lxml") stats = soup.find("span",attrs={"class": "ms-rteStyle-Quote"}) #TODO: byte / string issue here on updatecheck updatecheck = stats.text[stats.text.find("at "):][3:] if os.path.isfile(lastupdatedfile): lastupdate = open(lastupdatedfile).read() if lastupdate == updatecheck: print("Skipping check, no new update.") os._exit(0) else: print("***UPDATE***\nOld: {}".format(lastupdate)) print("New: {}".format(updatecheck)) tables = pd.read_html(html_content, header=0) df = tables[3] totalCounties = 67 print("Pennsylvania Data ({})".format(updatecheck)) deathsTotal = int(df["Deaths"].sum()) casesTotal = int(df["Number of Cases"].sum()) mortalityPercent = round((deathsTotal / casesTotal) * 100,2) reportingTotal = int(df["County"].count()) reportingCases = df["Number of Cases"] reportingCasesPct = round((reportingCases.count() / totalCounties) * 100,2) reportingDeathsObj = df.apply(lambda x: True if x['Deaths'] > 0 else False, axis=1) reportingDeaths = len(reportingDeathsObj[reportingDeathsObj == True].index) reportingDeathsPct = round((reportingDeaths / reportingTotal) * 100,2) print("Total Cases: {}".format(casesTotal)) print("Total Deaths: {}".format(deathsTotal)) print("Mortality Rate(%): {}".format(mortalityPercent)) print("Counties Reporting Cases: {}".format(reportingCases.count())) print("Counties Reporting Cases(%): {}".format(reportingCasesPct)) print("Counties Reporting Deaths: {}".format(reportingDeaths)) print("Counties Reporting Deaths(% of counties reporting cases): {}".format(reportingDeathsPct)) f=open(lastupdatedfile,"w") f.write(updatecheck) f.close() # Add some notification stuff here...
Output Example
Pennsylvania Data (12:00 p.m. on 3/28/2020) Total Cases: 2751 Total Deaths: 34 Mortality Rate(%): 1.24 Counties Reporting Cases: 56 Counties Reporting Cases(%): 83.58 Counties Reporting Deaths: 13 Counties Reporting Deaths(% of counties reporting cases): 23.21