Connecting to socrata datasets from R
The socrata data platform allows governments to make data easily available, both for their own use and publicly.
The easiest way to get data that is hosted on a socrata platform into R for analysis is through the RSocrata package. This post walks through how to do this
First, we need to source some data. In this example, we are going to use data from the state of Connecticut’s COVID-19 portal. This will allow us to analyze daily cases, hospitalization trends and death rates progression across the state.
To do this, we need to find the source of the data tables. Connecticut makes this easy having provided a data portal here. We can navigate through to the county-level data to view the information. To get the link to the data, we can select the API button and get a link to the required JSON information
This gives us all the information we need to extract data and analyze this data set. To start with we will need to download devtools and install the RSocrata package from github. At this point we are also going to include a few other libraries to help with the analysis
#install.packages(“devtools”)
#devtools::install_github(“Chicago/RSocrata”)
library(RSocrata)
library(ggplot2)
library(lubridate)
library(scales)
library(dplyr)
Once we have everything installed and ready to go, we can load and cleanup the dataset. Note that the data is downloaded in character format and needs converting to numbers for calculations and charting. The JSON link from the online dataset is used here.
ct.county.data=read.socrata("https://data.ct.gov/resource/bfnu-rgqt.json")
ct.county.data$cases=as.numeric(ct.county.data$cases)
ct.county.data$hospitalization=as.numeric(ct.county.data$hospitalization)
ct.county.data$deaths=as.numeric(ct.county.data$deaths)
This data is structured as running-totals of cases and deaths, along with the current daily hospitalization rates. To clean up this data, we can uses the mutate function from dply to create a difference between 2 values within each group (groups in this case are counties)
ct.county.data=ct.county.data%>%
group_by(county)%>%
mutate(cases_daily=cases-lag(cases, default=0),
deaths_daily=deaths-lag(deaths, default=0))%>%
select(dateupdated, county, hospitalization, cases_daily, deaths_daily)%>%
gather("metric", "value", -c(dateupdated, county))
In addition, this data has been converted into ‘long’ format, to make the visualization of the data slightly easier. Using ggplot we can quickly chart this data to see the latest trends
county.chart=ggplot(data=ct.county.data)+
geom_line(aes(x=dateupdated, y=value, color=county))+
facet_grid(metric~., scales="free")+
scale_color_brewer(palette="Dark2")+
scale_y_continuous(labels=comma)+
theme_minimal()+
labs(title="",
subtitle="daily COVID-19 cases in Connecticut by county",
x="",
y="",
caption="")
from this data, through May 12th 2020, we can see the declining hospitalization trends, but also the high degree of variability in daily cases that have been caused by the delivery of sample-results (In this view, the case data is mapped to the sample result date, not the date of the test)