Thursday 7 May 2020

Highway England data traffic flow and Covid-19

Introduction

I collected and ingested 15 minutes API Highway England Road Flow reports for around 400 sites across England. This data spans from 2019 till April 2020 i.e. reports from before and including the beginning of Covid-19 pandemic. The idea was to see what new patterns emerged during the new situation.

Data collection and ingestion

I used Python to collect data. After the collection of latitude, longitude, road names plus a few more columns were added to make the dataset richer. 

Apache-Druid unified console showing new columns in INFORMATION_SCHEMA


Data was ingested into Apache Druid using Druid's UI. As my set-up was a single micro machine I encountered some issues to load larger data set. To address that I first increased JVM memory for middle manager, overlord and historical node to prevent garbage collector falling into vicious circle of exceeding overhead limit. But it was not enough so I had to partition ingested data set based on site and timespan.

Data quality

Once reports were ingested into Druid I decided to measure the data quality. The simplest thing I could do was to check if  all days have the same number of reports. I was looking for ninety six reports a day as they come in fifteen minutes intervals. I used Druid's grouping capabilities and created traffic_counts dataset. I used traffic dataset which provides the finest granularity of source data. Druid rolled up traffic dataset instantaneously and created new metric based on reports count per day. The last step was to visualise the new data set in Superset.

Y axis is day of year, X axis is sites

In the picture above, white spaces show missing measurements (site did not deliver data). As we see, some of these sites were not working for quite some time. But volume of missing data was not huge and did not stop me from generating reports I intended. The dark turquoise colour represents the correct number of reports per day for site i.e. ninety six per day. The most interesting are three horizontal lines. They show that number of reports that are consistently wrong for all sites in the same days of year. It may be caused of faulty database export process and it is something that Highway England should investigate.

First pattern found

During March 2019 volume of vehicles using the selected main four roads was constant, hence the pattern shown in the figure below was stable. People were travelling mostly on Mondays and coming back home on Fridays. The least travelling was observed during weekends.

Average volume of vehicles in 2019 March
For selected four main roads, X axis is March, Y axis is volume per day

In year 2020 the pattern is different. People started to follow the government advice to reduce travelling early in March. But the deepest drop in travel can be observed in 27th of March when the country lockdown began.

Average volume of vehicles in 2020 March
For selected four main roads, X axis is March, Y axis is volume per day

What is next? 

As today data for April 2020 has not been published yet. The next step could be to compare March and April 2020 to see how people follow the new rule. Another idea is to drill more into roads and see for example how traffic flows between specific junctions.