Time series data is massive! Learn how to organize and display it in a meaningful way to your users.
When working with time series data, the amount of data is overwhelming. You usually are dealing with millions of rows of data. To consume all this data, we’ll need a database to help us sort, filter, and group it so that it’s meaningful. Once we have the data in the correct format, we’ll need to graph it, so it’s easily digestible.
Top JS Graphing Libraries for Time Series Data
There are a lot of graphing libraries out there. Pick one based on your specific requirements.
Here are a few libraries that are highly rated for different purposes:
After looking at a few of these and trying to plot time series data, I came across C3. It is compelling and customizable. We use it for data visualization within Hyprcubd’s console.
Now that we have a graphing library picked out, we need some data. Any data here will do with only one requirement – time! I’ve chosen the NYC Taxi and Limousine Commission (TLC) trip data since it’s free and easily accessible. You can find more about it here: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Using the High Volume For Hire Vehicle data for January 2020, there are over 20 million rows. This amount of data is pretty standard when dealing with time series data sets. We are going to take the first 1M rows for use in our test database.
Here is our data model:
To get some insights out of this data, we need to put this into a database. A database will give us lots of options for grouping, filtering, sorting, and aggregating. Let’s put this into MySQL for our example.
create table trip ( id int not null primary key auto_increment, license varchar(10), dispatch_loc varchar(10), pickup_time datetime, dropoff_time datetime, pickup_loc int, dropoff_loc int, sr_flag int, index (pickup_time) );
Loading the data is beyond the scope of this article. MySQL has a useful reference here on loading data from a CSV. Be warned – loading the full 20M rows will take a significant amount of time. Using the smallest MySQL instance size on DigitalOcean, I was able to load about 1M rows in about 10 minutes.
What answers do we want to figure out? Let’s start with some basic questions.
- How many trips occurred in January?
- How many trips occur each day?
- What are the peak trip times?
- What is the market share of trips by service? (Uber, Lyft, etc.)
Since we aren’t working with the full data set, let’s start with the second question:
How many trips per day?
select date(pickup_time), count(1) from trip group by date(pickup_time);
This query truncates the time to the current date and counts for that day. This is relatively straightforward. Let’s go to the next question:
What are the peak hours for the first day?
select from_unixtime(t.t*3600) as hour, t.c as trips from ( select round(unix_timestamp(pickup_time)/3600) as t, count(1) as c from trip where day(pickup_time) = 1 group by round(unix_timestamp(pickup_time)/3600) ) as t;
MySQL does not have an easy way (that I know of) to extract out a specific time interval and group it by that interval. If anyone knows of a better way, please leave a comment below, and I’ll update it!
Here is the data returned:
Let’s graph it:
This looks pretty good! Since we aren’t working with the full dataset, we can’t draw any meaningful conclusions. With that said, the time series graph is looking pretty.
C3 has excellent features for time series data, such as the Flow API. The API allows you to stream data into the chart as it comes to your UI in real-time. This is very useful with time series databases.
What is the market share between the different services?
Although this is not a time series graph, it is useful to see the approximate rankings of each service. Uber receives the lion’s share of trips in NYC.
Time Series Database – Hyprcubd
How does a time series database differ when handling this data, and would it make your life easier? Absolutely!
First, the schema would have to change slightly:
create table trip ( time time, license string, dispatch_loc string, dropoff_time time, pickup_loc int, dropoff_loc int, sr_flag int );
Hyprcubd supports basic data types. One improvement is from varchar to string. Since we are a column-oriented database, a string can be any length, and it’s variable length encoded to maximize space.
The first time column is required. We are using pickup_datetime as time in this example.
How many trips per day?
select time, count(time) from trip timeseries 1d;
This query should look pretty familiar to most developers who know SQL. The only difference is the time series expression. This expression is unique to Hyprcubd. It tells the query engine to group the results by one day. This version is a slight improvement over the MySQL query.
What are the peak hours on the first day?
select time, count(time) from trip timeseries 1h;