Taming Transportation Data

The Problem

While the transportation logistics industry has collected more and more data on the movements of vehicles in the past decade, sophisticated analysis has become extremely difficult due to the sheer volume of data generated by sensors, vendors, and personnel.

Pilosa was built to handle massive and high cardinality data sets such as these. Moreover it accelerates queries of the data in order to improve data-based logistical decisions that in turn reduce costs while achieving high delivery standards.

Logistics optimization currently represents the largest cost reduction opportunity for the transportation industry.

The Data

As an example data set, and following a lead from Todd Schneider’s research in this area, the Pilosa team examined data collected by the New York City Taxi & Limousine Commission collected from January 2009 to June 2015. This data set includes detailed information on 1.3 billion individual taxi trips in New York City. Pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts are all captured to create an impressive illustration of the city’s transportation needs. At the end of this article we link to the sample data set so that you can play with it in Pilosa.

In the table below, we list the fields of the trip dataset (they are provided as-is):

Field Name Description Type
pickup_datetime The date and time when the meter was engaged. Floating Timestamp
dropoff_datetime The date and time when the meter was disengaged. Floating Timestamp
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip Text
rate_code The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride Number
Pickup_longitude Longitude where the meter was engaged. Number
Pickup_latitude Latitude where the meter was engaged. Number
Dropoff_longitude Longitude where the meter was disengaged. Number
Dropoff_latitude Latitude where the meter was disengaged. Number
Passenger_count The number of passengers in the vehicle. This is a driver-entered value. Number
Trip_distance The elapsed trip distance in miles reported by the taximeter. Number
Fare_amount The time-and-distance fare calculated by the meter. Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges. Number
Extra - Number
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered rate in use. Number
Tip_amount Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. Number
Tolls_amount Total amount of all tolls paid in trip. Number
Ehail_fee - Number
Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. Number
Total_amount The total amount charged to passengers. Does not include cash tips. Number
Payment_type A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip Number
Trip_type A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch Number


Pilosa is a distributed bitmap index that sits on top of a data store. The key to understanding and then using Pilosa is converting data such that it is represented in ones and zeros. This dramatically reduces the size as well as accelerates query times.

For example, timestamps are important information, but we tend to be interested in individual components of a timestamp, especially when analyzing data with cyclic trends. Timestamp components are stored as groups of bitmaps, known as “frames”. We create one frame for the day of the week, as illustrated in the following table. Along with similar frames for year, month, and time of day, this accelerates queries that ask questions about rides belonging to any logical combination of these time groups.

Pickup Day Ride 1 Ride 2
Monday 1 0
Tuesday 0 0
Wednesday 0 0
Thursday 0 1
Friday 0 0
Saturday 0 0
Sunday 0 0

This table represents a small portion of one frame, showing whether individual bits are set.

This is the ultimate data model:

Frame Ride 1 Ride 2
cab_type 0 (Yellow) 2 (Uber)
dist_miles 4 22
total_amount_dollars 18 27
passenger_count 1 2
drop_grid_id 9424 8893
drop_year 2013 2013
drop_month 8 9
drop_day 0 (Monday) 3 (Thursday)
drop_time 14 15
pickup_grid_id 9366 7136
pickup_year 2013 2013
pickup_month 8 9
pickup_day 0 3
pickup_time 13 13
average_speed 30 17
duration 14 44

In this table, the integers are IDs that indicate which single bit is set in the corresponding frame for a given ride.

Importing the data into Pilosa

After deciding on a data model we created the index and its associated frames using the Pilosa API. The API enables you to do bulk uploads, or to write streaming data into an index, as in this case. Check out the Pilosa Development Kit for in-depth examples of data imports.

In this case we were able to import the 1.3B records into Pilosa in approximately 90 minutes. Condensing the 500GB data set with this model gives us a 40GB representation.

Analyzing the data with Pilosa

Because each data point includes pickup/dropoff times and total distance travelled, it’s easy to determine the average speed of the trip. As an example, we use this as a first order approximation of congestion. We created a frame representing average speed, with a spacing of 1 mph.

In order to answer questions about congestion, we needed to first determine what speeds constitute slow traffic. One of the basic queries in Pilosa is the TopN function, and we used that to get a list of all the different average speeds. By performing a count on each we built a histogram of how many rides fall into each speed bucket, and decided from there which buckets deviate enough from the norm to constitute congestion.

Here are a few examples of typical requests in the Pilosa Query Language that ships with the Pilosa Community Edition. For brevity, <slow speeds> for example represents a list of bitmaps in the speed frame.

  1. TopN(Union(<slow speeds>), frame="pickup_loc"

    This gives us all the pickup locations with the most slow rides. As with all TopN calls, the results are ordered by count, which means the number of slow rides here.

  2. TopN(Intersect(Union(<slow speeds>), <pickup location A>)), frame="dropoff_loc")

    This gives us all the dropoff locations associated with slow rides originating at a specific pickup location, which might be as specific as a city block, or as broad as a borough.

  3. TopN(Intersect(Union(<slow speeds>), Union(<pickup locations>)), frame="dropoff_loc")

    This gives us all the dropoff locations associated with slow rides originating at any one of a group of pickup locations.

  4. TopN(Intersect(Union(<slow speeds>), Union(<pickup locations>), Intersect(<rush hour times>)), frame="dropoff_loc")

    This gives us all the dropoff locations associated with slow rides that occurred during rush hour, originating at any one of a group of pickup locations.

What is remarkable is that Pilosa is able to perform logical operations by querying over 1 billion data points, in under a second. The core data model, while extremely simple, is flexible enough to support interesting queries with simple techniques like scalar bucketing, and minimal preprocessing.

With a minor extension to the scalar bucketing described above, we can use Pilosa for more advanced analysis. Representing locations as grid cells allows us to use the full power of Pilosa on two-dimensional data. For example, running a TopN query on billions of rides’ grid cells can produce a heatmap representation, for logical queries like the above examples, of any complexity. This technique results in a large number of grid cells, but this is no problem - Pilosa was built to support extremely high cardinality data.


Data TopN

References: Oracle. Improving Logistics & Transportation Performance with Big Data: Architect’s Guide and Reference Architecture Introduction. February 2015.

Paul A, Saravanan V, Thangaiah PRJ. “Data Mining Analytics to Minimize Logistics Cost.” International Journal of Advances in Science and Technology. 2(3). 2011. p89-107.

New York City Taxi & Limousine Commission. TLC Trip Record Data. Updated March 13, 2017.

Get even more from Pilosa

Stay updated on our progress and upcoming events.