Administration Guide

Installing in production

Hardware

Pilosa is a standalone, compiled Go application, so there is no need to worry about running and configuring a Java VM. Pilosa can run on very small machines and works well with even a medium sized dataset on a personal laptop. If you are reading this section, you are likely ready to deploy a cluster of Pilosa servers handling very large datasets or high velocity data. These are guidelines for running a cluster; specific needs may differ.

Memory

Pilosa holds all row/column bitmap data in main memory. While this data is compressed more than a typical database, available memory is a primary concern. In a production environment, we recommend choosing hardware with a large amount of memory >= 64GB. Prefer a small number of hosts with lots of memory per host over a larger number with less memory each. Larger clusters tend to be less efficient overall due to increased inter-node communication.

CPUs

Pilosa is a concurrent application written in Go and can take full advantage of multicore machines. The main unit of parallelism is the slice, so a single query will only use a number of cores up to the number of slices stored on that host. Multiple queries can still take advantage of multiple cores as well though, so tuning in this area is dependent on the expected workload.

Disk

Even though the main dataset is in memory Pilosa does back up to disk frequently. We recommend SSDs–especially if you have a write heavy application.

Network

Pilosa is designed to be a distributed application, with data replication shared across the cluster. As such every write and read needs to communicate with several nodes. Therefore fast internode communication is essential. If using a service like AWS we recommend that all node exist in the same region and availability zone. The inherent latency of spreading a Pilosa cluster across physical regions it not usually worth the redundancy protection. Since Pilosa is designed to be an Indexing service there already should be a system of record, or ability to rebuild a Cluster quickly from backups.

Overview

While Pilosa does have some high system requirements it is not a best practice to set up a cluster with the fewest, largest machines available. You want an evenly distributed load across several nodes in a cluster to easily recover from a single node failure, and have the resource capacity to handle a missing node until it’s repaired or replaced. Nor is it advisable to have many small machines. The internode network traffic will become a bottleneck. You can always add nodes later, but that does require some down time.

Open File Limits

Pilosa requires a large number of open files to support its memory-mapped file storage system. Most operating systems put limits on the maximum number of files that may be opened concurrently by a process. On Linux systems, this limit is controlled by a utility called ulimit. Pilosa will automatically attempt to raise the limit to 262144 during startup, but it may fail due to access limitations. If you see errors related to open file limits when starting Pilosa, it is recommended that you run sudo ulimit -n 262144 before starting Pilosa.

On Mac OS X, ulimit does not behave predictably. This blog post contains information about setting open file limits in OS X.

Importing and Exporting Data

Importing

The import API expects a csv of RowID,ColumnID’s.

When importing large datasets remember it is much faster to pre sort the data by RowID and then by ColumnID in ascending order. You can use the --sort flag to do that. Also, avoid querying Pilosa until the import is complete, otherwise you will experience inconsistent results.

pilosa import --sort -i project -f stargazer project-stargazer.csv
Importing Field Values

If you are using BSI Range-Encoding field values, you can import field values for a single frame and single field using --field. The CSV file should be in the format ColumnID,Value.

pilosa import -i project -f stargazer --field star_count project-stargazer-counts.csv

Note that you must first create a frame with Range Encoding enabled and a field. View Create Frame for more details.

Exporting

Exporting Data to csv can be performed on a live instance of Pilosa. You need to specify the Index, Frame, and View(default is standard). The API also expects the slice number, but the pilosa export sub command will export all slices within a Frame. The data will be in csv format RowID,ColumnID and sorted by column ID.

curl "http://localhost:10101/export?index=repository&frame=stargazer&slice=0&view=standard" \
     --header "Accept: text/csv"

Versioning

Pilosa follows Semantic Versioning.

MAJOR.MINOR.PATCH:

PQL versioning

The Pilosa server should support PQL versioning using HTTP headers. On each request, the client should send a Content-Type header and an Accept header. The server should respond with a Content-Type header that matches the client Accept header. The server should also optionally respond with a Warning header if a PQL version is in a deprecation period, or an HTTP 400 error if a PQL version is no longer supported.

Upgrading

When upgrading, upgrade clients first, followed by server for all Minor and Patch level changes.

Backup/restore

Pilosa continuously writes out the in-memory bitmap data to disk. This data is organized by Index->Frame->Views->Fragment->numbered slice files. These data files can be routinely backed up to restore nodes in a cluster.

Depending on the size of your data you have two options. For a small dataset you can rely on the periodic anti-entropy sync process to replicate existing data back to this node.

For larger datasets and to make this process faster you could copy the relevant data files from the other nodes to the new one before startup.

Note: This will only work when the replication factor is >= 2

Using Index Sync

Copying data files manually

Diagnostics

Each Pilosa cluster is configured by default to share anonymous usage details with Pilosa Corp. These metrics allow us to understand how Pilosa is used by the community and improve the technology to suit your needs. Diagnostics are sent to Pilosa every hour. Each of the metrics are detailed below as well as opt-out instructions.

You can opt-out of the Pilosa diagnostics reporting by setting either the command line configuration option --metric.diagnostics=false, use the PILOSA_METRIC_DIAGNOSTICS environment variable, or the TOML configuration file [metric] diagnostics option.

Metrics

Pilosa can be configured to emit metrics pertaining to its internal processes in one of two formats: Expvar or StatsD. Metric recording is disabled by default. The metrics configuration options are:

Tags

StatsD Tags adhere to the DataDog format (key:value), and we tag the following:

Events

We currently track the following events