Pilosa was originally built for the very specific use case of arbitrary audience segmentation on hundreds of millions of people with tens of millions of attributes. As we’ve spun out a separate company around Pilosa, a natural first step was to test its efficacy on other types of problems.
While the user segmentation use case has a very high number of attributes, the data is extremely sparse—most attributes are only associated with a handful of people, and most individuals only have a few hundred or a few thousand attributes. Pilosa handles this type of data gracefully; in just milliseconds, one can choose any boolean combination of the millions of attributes and find the segment of users which satisfies that combination.
Image courtesy of Umbel
For audience segmentation, we’d happily pit Pilosa against anything out there on similar hardware. However, if we want Pilosa to be an index which serves as a general purpose query acceleration layer over other data stores, it will have to be effective at more than just segmentation queries, and more than just sparse, high cardinality data. In order to test Pilosa, we needed a dataset which had dense, lower cardinality data, and ideally one which has been explored by other solutions so that we could get a feel for how we might fare against them. The billion taxi ride dataset fit the bill perfectly. There are myriad blog posts analyzing the dataset with various technologies; here are just a few:
Of particular usefulness to us are the series of posts by Mark Litwintschik and the performance comparison table he compiled. Here is the table for reference:
We implemented the same four queries against Pilosa so that we could get some comparison of performance against other solutions.
Here is the English description of each query:
Now we have a dataset and a set of queries which are far outside Pilosa’s comfort zone, along with a long list of performance comparisons on different combinations of hardware and software.
If you’d like to learn more about how we modeled the dataset in Pilosa and structured the queries, please see our transportation use case post.
So instead of tens of millions of boolean attributes, we have just a few attributes with more complex types: integers, floating point, timestamp, etc. In short, this data is far more suitable to a relational database than the data for which Pilosa was designed.
We stood up a 3-node Pilosa cluster on AWS c4.8xlarge instances, with an additional c4.8xlarge to load the data. We used our open source pdk tool to load the data into Pilosa with the following arguments:
pdk taxi -b 2000000 -c 50 -p <pilosa-host-ip>:10101 -f <pdk_repo_location>/usecase/taxi/greenAndYellowUrls.txt
This took about 2 hours and 50 minutes, which includes downloading all of the csv files from S3, parsing them, and loading the data into Pilosa.
If we were to add our results to Mark’s table, it would look like the following:
Note that the hardware and software are different for each setup, so direct comparisons are difficult.
We should note that Pilosa “cheats” a bit on query 1; due to the way it stores data, Pilosa already has this result precomputed, so the query time is mostly network latency.
For the remainder of the queries, however, Pilosa does remarkably well—in some cases beating out exotic hardware such as multi-GPU setups. The 0.177s time on query 3 was particularly startling—performance was along the lines of 8 Nvidia Pascal Titan Xs. It looks like kdb+/q is beating us pretty soundly, but keep in mind that those Xeon Phi 7210s have 256 hardware threads per chip, as well as 16GB of memory on the package. This gives them performance and memory bandwidth closer to GPUs than CPUs. They’re also about $2400 a piece.
For us, these results are enough to validate spending more time optimizing Pilosa for uses outside of its original intent. We know that Pilosa’s internal bitmap compression format is not optimized for dense data, and more research has been done in this area with exciting results (e.g. roaring-run), so we have reason to believe that there is significant room for improvement in these numbers.
UPDATE: After implementing run-length encoding, and some other optimizations, Pilosa is significantly faster on many of these queries and beats every other system on Query 3!
Jaffee is a lead software engineer at Pilosa. When he’s not crunching a billion taxi rides, he enjoys jiu-jitsu, building mechanical keyboards, and spending time with family. Follow him on Twitter at @mattjaffee.
Alan is a software engineer at Pilosa. When he’s not mapping the universe, you can find him playing with laser cutters, building fidget spinners for his dog, or practicing his sick photography skills. Find him on Twitter @gsnark.
Stay updated on our progress and upcoming events.