How to connect to EuroStat with R
Eurostat is part of the European Commission and has responsibilities to harmonize statistical methods across member states. This includes countries in the EU, EFTA, and candidate countries. They provide statistical information to the institutions of the EU, as well as the public
The data collected covers a range of population, social, economic, industrial, environmental and geographical data. The most important statistical data are made available by press release and disseminated through their databases at 11am on the day of release.
The overall database is available on their website here. In addition access is available through SDMX and JSON requests, but the handy r package eurostat has been developed as part of the rOpenGov initiative and is available on CRAN
install.packages('eurostat')
library(eurostat)
while the eurostat database can be navigated by the tree structure, it can be a bit cumbersome to use.
The easier approach, from within R is to search the eurostat database for tables, datasets or folders containing a given keyword
population_tables = search_eurostat("population", type = "table")
population_datasets = search_eurostat("population", type = "dataset")
population_folders = search_eurostat("population", type = "folder")
This will return all data that has a reference to, in our example, population, along with the fact code, date of last update, the date of last change to the structural data and the start & ends dates of the information available. A few searches of the data will help refine and find the information that you are looking for, but take a bit of care as the search criteria returns only exact matches for a given string
In this example, we are going to look at the distribution of population density within each country by NUTS3 region. More background information on the NUTS classification can be found here. A quick search for “NUTS 3” will identify ‘demo_r_d3dens’ as the code we need to use to query the database to get the population density data, and we are also going to download the population metrics using code ‘demo_r_pjanaggr3’
population.datasets = search_eurostat("NUTS 3", type = "dataset")
pop.density.data=get_eurostat("demo_r_d3dens", type="code", time_format="num")
pop.data=get_eurostat("demo_r_pjanaggr3", type="code", time_format="num")
This data set starts to become useful, but is locked to the NUTS code system, without any information that may allow us to process it. It includes the NUTS3 data, but also the higher-level NUTS2, NUTS1 and county level data that we need to filter out. Thankfully the NUTS naming convention is hierarchical and we can break the geo codes apart quite easily. For out purposes we are going to filter the data to look at 2018 across regions, drop the time-series data and combine the two data sets
library(tidyr) #adding some packages to help data processing
library(dplyr)
pop.density.data.2018=pop.density.data%>%
filter(time==2018)%>%
select(geo, values)
colnames(pop.density.data.2018)=c("geo", "pop_density")
pop.data.2018=pop.data%>%
filter(time==2018)%>%
select(geo, values)%>%
group_by(geo)%>% #in this data set we need to aggregate up the demographic population data to total geo
summarize_all(sum)
colnames(pop.data.2018)=c("geo", "population")
population.data=merge(pop.density.data.2018, pop.data.2018,
by.x=c("geo"),
by.y=c("geo"),
all.x=T, all.y=T)
population.data$values[is.na(population.data$pop_density)]=0
population.data$values[is.na(population.data$population)]=0
population.data$country=substr(population.data$geo, 1, 2) # country is the first 2 letters
population.data$code=substr(population.data$geo, 3, 5)
population.data$nuts1=substr(population.data$code, 1, 1) # NUTS1 level is the first character after the country
population.data$nuts2=substr(population.data$code, 1, 2) # NUTS2 level is the first 2 characters after the country
population.data$nuts2[nchar(population.data$nuts2)<2]=NA # if only one character then this is the NUTS1 level rollup
population.data$nuts3=substr(population.data$code, 1, 3) # NUTS3 level is the first 3 characters after the country
population.data$nuts3[nchar(population.data$nuts3)<3]=NA # if there are less than 3 character after the country this is a NUTS1 or NUTS 2 code
This results in nicely structured data to do some analysis on. As an example, let’s compare the distribution of population density by population between Germany, Spain, France, UK and Poland.
analysis.data=population.data%>%
filter(country %in% c("DE", "ES", "FR", "PL", "UK"))%>% # only include selected countries
filter(!is.na(nuts3))%>% # only include the lowest level of heirarchy for the NUTS3 data
select(country, nuts3, population, pop_density )%>% #select only the facts that we need
arrange(country, desc(pop_density))%>% # rank each contry by population density
group_by(country)%>% #set the grouping variable as the country level
mutate(cumsum_pop=cumsum(population)/sum(population)) #create a cumulative sum of % of population by country
library(ggplot2)
library(scales)
library(RColorBrewer)
chart=ggplot(data=analysis.data)+
geom_line(aes(x=cumsum_pop, y=pop_density, color=country))+
scale_color_brewer(palette="Dark2")+
scale_y_continuous(labels=comma)+
scale_x_continuous(labels=percent)+
theme_minimal()+
theme(legend.title = element_blank())+
labs(title="Distribution of Population Density across Countries",
subtitle="population per square km",
x="",
y="",
caption="% of country's population")
From this we can see the France has some very densely population regions in Paris housing about 12% of the population, but this drops off very quickly. The UK has some regions close to Paris, but does not see the drop-off in population density that France has. Germany has a wider distribution of fairly-dense areas, but not as extreme as some parts of France or the UK
As it comes to Spain, we can see the few densely populated cities house about 15% of the population, and then the country becomes much more sparsely populated.