While the transportation logistics industry has collected more and more data on the movements of vehicles in the past decade, sophisticated analysis has become extremely difficult due to the sheer volume of data generated by sensors, vendors, and personnel.
Pilosa was built to handle massive and high cardinality data sets such as these. Moreover it accelerates queries of the data in order to improve data-based logistical decisions that in turn reduce costs while achieving high delivery standards.
Logistics optimization currently represents the largest cost reduction opportunity for the transportation industry.
As an example data set, and following a lead from Todd Schneider’s research in this area, the Pilosa team examined data collected by the New York City Taxi & Limousine Commission collected from January 2009 to June 2015. This data set includes detailed information on 1.3 billion individual taxi trips in New York City. Pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts are all captured to create an impressive illustration of the city’s transportation needs. At the end of this article we link to the sample data set so that you can play with it in Pilosa.
In the table below, we list the fields of the trip dataset (they are provided as-is):
|pickup_datetime||The date and time when the meter was engaged.||Floating Timestamp|
|dropoff_datetime||The date and time when the meter was disengaged.||Floating Timestamp|
|Store_and_fwd_flag||This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip||Text|
|rate_code||The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride||Number|
|Pickup_longitude||Longitude where the meter was engaged.||Number|
|Pickup_latitude||Latitude where the meter was engaged.||Number|
|Dropoff_longitude||Longitude where the meter was disengaged.||Number|
|Dropoff_latitude||Latitude where the meter was disengaged.||Number|
|Passenger_count||The number of passengers in the vehicle. This is a driver-entered value.||Number|
|Trip_distance||The elapsed trip distance in miles reported by the taximeter.||Number|
|Fare_amount||The time-and-distance fare calculated by the meter. Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.||Number|
|MTA_tax||$0.50 MTA tax that is automatically triggered based on the metered rate in use.||Number|
|Tip_amount||Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.||Number|
|Tolls_amount||Total amount of all tolls paid in trip.||Number|
|Improvement_surcharge||$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.||Number|
|Total_amount||The total amount charged to passengers. Does not include cash tips.||Number|
|Payment_type||A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip||Number|
|Trip_type||A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch||Number|
Pilosa is a distributed bitmap index that sits on top of a data store. The key to understanding and then using Pilosa is converting data such that it is represented in ones and zeros. This dramatically reduces the size as well as accelerates query times.
For example, timestamps are important information, but we tend to be interested in individual components of a timestamp, especially when analyzing data with cyclic trends. Timestamp components are stored as groups of bitmaps, known as “frames”. We create one frame for the day of the week, as illustrated in the following table. Along with similar frames for year, month, and time of day, this accelerates queries that ask questions about rides belonging to any logical combination of these time groups.
|Pickup Day||Ride 1||Ride 2||…|
This table represents a small portion of one frame, showing whether individual bits are set.
|Frame||Ride 1||Ride 2||…|
|cab_type||0 (Yellow)||2 (Uber)|
|drop_day||0 (Monday)||3 (Thursday)|
In this table, the integers are IDs that indicate which single bit is set in the corresponding frame for a given ride.
After deciding on a data model we created the index and its associated frames using the Pilosa API. The API enables you to do bulk uploads, or to write streaming data into an index, as in this case. Check out the Pilosa Development Kit for in-depth examples of data imports.
In this case we were able to import the 1.3B records into Pilosa in approximately 90 minutes. Condensing the 500GB data set with this model gives us a 40GB representation.
Because each data point includes pickup/dropoff times and total distance travelled, it’s easy to determine the average speed of the trip. As an example, we use this as a first order approximation of congestion. We created a frame representing average speed, with a spacing of 1 mph.
In order to answer questions about congestion, we needed to first determine what speeds constitute slow traffic. One of the basic queries in Pilosa is the TopN function, and we used that to get a list of all the different average speeds. By performing a count on each we built a histogram of how many rides fall into each speed bucket, and decided from there which buckets deviate enough from the norm to constitute congestion.
Here are a few examples of typical requests in the Pilosa Query Language that ships with the Pilosa Community Edition.
TopN(Union(<slow speeds>), frame="pickup_loc")
This gives us all the pickup locations with the most slow rides. As with all TopN calls, the results are ordered by count, which means the number of slow rides here.
TopN(Intersect(Union(<slow speeds>), <pickup location A>), frame="dropoff_loc")
This gives us all the dropoff locations associated with slow rides originating at a specific pickup location, which might be as specific as a city block, or as broad as a borough.
TopN(Intersect(Union(<slow speeds>), Union(<pickup locations>)), frame="dropoff_loc")
This gives us all the dropoff locations associated with slow rides originating at any one of a group of pickup locations.
TopN(Intersect(Union(<slow speeds>), Union(<pickup locations>), Intersect(<rush hour times>)), frame="dropoff_loc")
This gives us all the dropoff locations associated with slow rides that occurred during rush hour, originating at any one of a group of pickup locations.
What is remarkable is that Pilosa is able to perform logical operations by querying over 1 billion data points, in under a second. The core data model, while extremely simple, is flexible enough to support interesting queries with simple techniques like scalar bucketing, and minimal preprocessing.
With a minor extension to the scalar bucketing described above, we can use Pilosa for more advanced analysis. Representing locations as grid cells allows us to use the full power of Pilosa on two-dimensional data. For example, running a TopN query on billions of rides’ grid cells can produce a heatmap representation, for logical queries like the above examples, of any complexity. This technique results in a large number of grid cells, but this is no problem - Pilosa was built to support extremely high cardinality data.
References: Oracle. Improving Logistics & Transportation Performance with Big Data: Architect’s Guide and Reference Architecture Introduction. February 2015.
Paul A, Saravanan V, Thangaiah PRJ. “Data Mining Analytics to Minimize Logistics Cost.” International Journal of Advances in Science and Technology. 2(3). 2011. p89-107.
New York City Taxi & Limousine Commission. TLC Trip Record Data. Updated March 13, 2017.
Stay updated on our progress and upcoming events.