The central component of Pilosa’s data model is a boolean matrix. Each cell in the matrix is a single bit - if the bit is set, it indicates that a relationship exists between that particular row and column.
Rows and columns can represent anything (they could even represent the same set of things - a bigraph). Pilosa can associate arbitrary key/value pairs (referred to as attributes) to rows and columns, but queries and storage are optimized around the core matrix.
Pilosa lays out data first in rows, so queries which get all the set bits in one or many rows, or compute a combining operation on multiple rows such as Intersect or Union are the fastest. Pilosa categorizes rows into different fields and quickly retrieves the top rows in a field sorted by the number of columns set in each row.
Please note that Pilosa is most performant when row and column IDs are sequential starting from 0. You can deviate from this to some degree, but setting a bit with column ID 263 on a single-node cluster, for example, will not work well due to memory limitations.
Basic data model diagram
The purpose of the Index is to represent a data namespace. You cannot perform cross-index queries.
Column ids are sequential increasing integers and are common to all Fields within an Index. A single column often corresponds to a record in a relational table, although other configurations are possible, and sometimes preferable.
Row ids are sequential increasing integers namespaced to each Field within an Index.
Fields are used to segment rows within an index, for example to define different functional groups. A Pilosa field might correspond to a single field in a relational table, where each row in a standard Pilosa field represents a single possible value of the relational field. Similarly, an integer field could represent all possible integer values of a relational field.
The Pilosa index is a flexible structure; it can represent any sort of high-cardinality binary matrix. We have explored a number of modeling patterns in Pilosa use cases; one accessible example is a direct analogy to the relational model, summarized here.
TODO diagram showing a few rows of a relational table and corresponding pilosa index
|Database||N/A (internal: Holder)|
|Value (int)||Field.Value (see BSI)|
Row(Member=0) selects all entities with a bit set in row 0 of the Member field. We could just as well use row 1 to store this, in which case we would use
Row(Member=1), which looks a bit more intuitive. In the relational model, joins are often necessary. Because Pilosa supports extremely high cardinality in both rows and columns, many types of joins are accomplished with basic Pilosa queries across multiple fields. For example, this SQL join:
select AVG(p.Age) from People p inner join PersonCar pc on pc.PersonID=p.ID inner join Cars c on pc.CarID=c.ID where c.Make = 'Ford'
can be accomplished with a Pilosa query like this (note that Sum returns a json object containing both the sum and count, from which the average is easily computed):
This is one major component of Pilosa’s ability to combine relationships from multiple data stores.
Ranked Fields maintain a sorted cache of column counts by Row ID (yielding the top rows by columns with a bit set in each). This cache facilitates the TopN query. The cache size defaults to 50,000 and can be set at Field creation.
Ranked field diagram
The LRU cache maintains the most recently accessed Rows.
LRU field diagram
Setting a time quantum on a field creates extra views which allow Range queries down to the time interval specified. For example - if the time quantum is set to
YMD, Range queries down to the granularity of a day are supported.
Attributes are arbitrary key/value pairs that can be associated with either rows or columns. This metadata is stored in a separate BoltDB data structure.
Column-level attributes are common across an index. That is, each column attribute applies to all bits in the corresponding column, across all fields in an index. Row attributes apply to all bits in the corresponding row.
Indexes are segmented into groups of columns called shards (previously known as slices). Each shard contains a fixed number of columns, which is the ShardWidth. ShardWidth is a constant that can only be modified at compile time, and before ingesting data. The default value is 220.
Query operations run in parallel, and they are evenly distributed across a cluster via a consistent hash algorithm.
Views represent the various data layouts within a Field. The primary View is called Standard, and it contains the typical Row and Column data. Time-based Views are automatically generated for each time quantum. Views are internally managed by Pilosa, and never exposed directly via the API.
The standard View contains the same Row/Column format as the input data.
If a Field has a time quantum, then Views are generated for each of the defined time segments. For example, for a field with a time quantum of
YMD, the following
Set() queries will result in the data described in the diagram below:
Set(3, A=8, 2017-05-18T00:00) Set(3, A=8, 2017-05-19T00:00)
Time quantum fueld diagram
Bit-Sliced Indexing (BSI) is the storage method Pilosa uses to represent multi-bit integers in a bitmap index. Integers are stored as n-bit, range-encoded bit-sliced indexes of base-2, along with an additional row indicating “not null”. This means that a 16-bit integer will require 17 rows: one for each 0-bit of the 16 bit-slice components (the 1-bit does not need to be stored because with range-encoding the highest bit position is always 1) and one for the non-null row. Pilosa can evaluate
Sum queries on these BSI integers. The result of a
Sum query includes a count, which can be used to compute an average with no other overhead.
Internally Pilosa stores each BSI
field as a
view. The rows of the
view contain the base-2 representations of the integer values. Pilosa manages the base-2 offset and translation that efficiently packs the integer value within the minimum set of rows.
For example, the following
Set() queries executed against BSI fields will result in the data described in the diagram below:
Set(1, A=1) Set(2, A=2) Set(3, A=3) Set(4, A=7) Set(2, B=1) Set(3, B=6)
BSI field diagram
Check out this blog post for some more details about BSI in Pilosa.