You are viewing the documentation for Pilosa v0.10. View the latest documentation for Pilosa v1.1.0.

Tutorials

Some of our tutorials work better as standalone repos, since you can git clone the instructions, code, and data all at once. Officially supported tutorials are listed here.

Setting Up a Secure Cluster

Introduction

Pilosa supports encrypting the communication between and to nodes in a cluster using TLS. In this tutorial, we will be setting up a three node Pilosa cluster running on the same computer. The same steps can be used for a multi-computer cluster but that requires setting up firewalls and other platform-specific configuration which is out of the scope of this tutorial.

This tutorial assumes that you are using a UNIX-like system, such as Linux or MacOS. Windows Subsystem for Linux (WSL) works equally well on Windows 10 systems.

Installing Pilosa and Creating the Directory Structure

If you haven’t already done so, install Pilosa server on your computer. For Linux and WSL (Windows Subsystem for Linux) use the Installing on Linux instructions. For MacOS use the Installing on MacOS. We do not support precompiled releases for other platforms, but you can always compile it yourself from source. See Build from Source.

After installing Pilosa, you may have to add it to your $PATH. Check that you can run Pilosa from the command line:

pilosa --help

Let’s create a directory for the tutorial to put all of our files and switch to that directory:

mkdir $HOME/pilosa-tls-tutorial && cd $_

Creating the TLS Certificate and Gossip Key

Securing a Pilosa cluster consists of securing the communication between nodes using TLS and Gossip encryption. Pilosa Enterprise additionally supports authentication and other security features, but those are not covered in this tutorial.

The first step is acquiring an SSL certificate. You can buy a commercial certificate or retrieve a Let’s Encrypt certificate but we will be using a self signed certificate for practical reasons. Using self-signed certificates is not recommended in production, since it makes man in the middle attacks easy.

The following command creates a 2048bit self-signed wildcard certificate for *.pilosa.local which expires 10 years later.

openssl req -x509 -newkey rsa:2048 -keyout pilosa.local.key -out pilosa.local.crt -days 3650 -nodes -subj "/C=US/ST=Texas/L=Austin/O=Pilosa/OU=Com/CN=*.pilosa.local"

The command above creates two files in the current directory:

Having created the SSL certificate, we can now create the gossip encryption key. Gossip encryption key file must be exactly 16, 24, or 32 bytes to select one of AES-128, AES-192, or AES-256 encryption. Reading random bytes from cryptographically secure /dev/random serves our purpose very well:

head -c 32 /dev/random > pilosa.local.gossip32

We now should have pilosa.local.gossip32 in the current directory with 32 random bytes.

Creating the Configuration Files

Pilosa supports passing configuration items using the command line, environment variables or a configuration file. We will use the last option in this tutorial and create three configuration files for our three nodes.

One of the nodes in the cluster must be chosen as the coordinator. We choose the first node as the coordinator in this tutorial. The coordinator is only important during cluster resizing operations, and otherwise acts like any other node in the cluster. In the future, the coordinator will be chosen transparently by distributed consensus, and this option will be deprecated.

Create node1.config.toml in the project directory and paste the following in it:

# node1.config.toml

data-dir = "node1_data"
bind = "https://01.pilosa.local:10501"

[cluster]
coordinator = true

[tls]
certificate = "pilosa.local.crt"
key = "pilosa.local.key"
skip-verify = true

[gossip]
seeds = ["01.pilosa.local:15000"]
port = 15000
key = "pilosa.local.gossip32"

Create node2.config.toml in the project directory and paste the following in it:

# node2.config.toml

data-dir = "node2_data"
bind = "https://02.pilosa.local:10502"

[tls]
certificate = "pilosa.local.crt"
key = "pilosa.local.key"
skip-verify = true

[gossip]
seeds = ["01.pilosa.local:15000"]
port = 16000
key = "pilosa.local.gossip32"

Create node3.config.toml in the project directory and paste the following in it:

# node3.config.toml

data-dir = "node3_data"
bind = "https://03.pilosa.local:10503"

[tls]
certificate = "pilosa.local.crt"
key = "pilosa.local.key"
skip-verify = true

[gossip]
seeds = ["01.pilosa.local:15000"]
port = 17000
key = "pilosa.local.gossip32"

Here is some explanation of the configuration items:

Final Touches Before Running the Cluster

Before running the cluster, let’s make sure that 01.pilosa.local, 02.pilosa.local and 03.pilosa.local resolve to an IP address. If you are running the cluster on your computer, it is adequate to add them to your /etc/hosts. Below is one of the many ways of doing that (mind the >>):

sudo sh -c 'printf "\n127.0.0.1 01.pilosa.local 02.pilosa.local 03.pilosa.local\n" >> /etc/hosts'

Ensure we can access the hosts in the cluster:

ping -c 1 01.pilosa.local
ping -c 1 02.pilosa.local
ping -c 1 03.pilosa.local

If any of the commands above return ping: unknown host, make sure your /etc/hosts contains the failed hostname.

Running the Cluster

Let’s open three terminal windows and run each node in its window. This will enable us to better observe what’s happening on which node.

Switch to the first terminal window, change to the project directory and start the first node:

cd $HOME/pilosa-tls-tutorial
pilosa server -c node1.config.toml

Switch to the second terminal window, change to the project directory and start the second node:

cd $HOME/pilosa-tls-tutorial
pilosa server -c node2.config.toml

Switch to the third terminal window, change to the project directory and start the third node:

cd $HOME/pilosa-tls-tutorial
pilosa server -c node3.config.toml

Let’s ensure that all three Pilosa servers are running and they are connected:

curl -k --ipv4 https://01.pilosa.local:10501/status

The -k flag is used to tell curl that it shouldn’t bother with checking the certificate the server provides and --ipv4 workarounds an issue on MacOS where the curl requests take a long time if the address resolves to 127.0.0.1. You can leave it out on Linux and WSL.

All nodes should be in the NORMAL state:

{"state":"NORMAL","nodes":[{"id":"98ebd177-c082-4c54-8d48-7e7c75857b52","uri":{"scheme":"https","host":"02.pilosa.local","port":10502},"isCoordinator":false},{"id":"a33dc0d6-c35f-4559-984a-e582bf032a21","uri":{"scheme":"https","host":"03.pilosa.local","port":10503},"isCoordinator":false},{"id":"e24ac014-ee2f-4cb0-b565-74df6c551f0a","uri":{"scheme":"https","host":"01.pilosa.local","port":10501},"isCoordinator":true}]}

Running Queries

Having confirmed that our cluster is running OK, let’s run a few queries. But before that, we need to create an index and a frame:

curl -k --ipv4 https://01.pilosa.local:10501/index/sample-index -d ''
{}

This will create index sample-index with default options. Let’s create the frame now:

curl -k --ipv4 https://01.pilosa.local:10501/index/sample-index/frame/sample-frame -d ''
{}

We just created frame sample-frame with default options.

Let’s run a SetBit query:

curl -k --ipv4 https://01.pilosa.local:10501/index/sample-index/query -d 'SetBit(frame="sample-frame", row=1, col=100)'
{"results":[true]}

Confirm that the bit was indeed set:

curl -k --ipv4 https://01.pilosa.local:10501/index/sample-index/query -d 'Bitmap(frame="sample-frame", row=1)'
{"results":[{"attrs":{},"bits":[100]}]}

The same response should be returned when querying other nodes in the cluster:

curl -k --ipv4 https://02.pilosa.local:10502/index/sample-index/query -d 'Bitmap(frame="sample-frame", row=1)'
{"results":[{"attrs":{},"bits":[100]}]}

What’s Next?

Check out our Administration Guide to learn more about making the most of your Pilosa cluster and Configuration Documentation to see the available options to configure Pilosa.

Setting Up a Docker Cluster

In this tutorial, we will be setting up a 2-node Pilosa cluster using Docker containers.

Running a Docker Cluster on a Single Server

The instructions below require Docker 1.13 or better.

Let’s first be sure that the Pilosa image is up to date:

docker pull pilosa/pilosa:latest

Then, create a virtual network to attach our containers. We are going to name our network pilosanet:

docker network create pilosanet

Let’s run the first Pilosa node and attach it to that virtual network. We set the first node as the cluster coordinator and use its address as the gossip seed. And also set the server address to pilosa1:

docker run -it --rm --name pilosa1 -p 10101:10101 --network=pilosanet pilosa/pilosa:latest server --bind pilosa1 --cluster.coordinator=true --gossip.seeds=pilosa1:14000

Let’s run the second Pilosa node and attach it to the virtual network as well. Note that we set the address of the gossip seed to the address of the first node:

docker run -it --rm --name pilosa2 --network=pilosanet pilosa/pilosa:latest server --bind pilosa2 --gossip.seeds=pilosa1:14000

Let’s test that the nodes in the cluster connected with each other:

curl localhost:10101/status
{"state":"NORMAL","nodes":[{"id":"2e8332d0-1fee-44dd-a359-e0d6ecbcefc1","uri":{"scheme":"http","host":"pilosa1","port":10101},"isCoordinator":true},{"id":"8c0dbcdc-9503-4265-8ad2-ba85a4bb10fa","uri":{"scheme":"http","host":"pilosa2","port":10101},"isCoordinator":false}]}

And similarly for the second node:

curl localhost:10102/status
{"state":"NORMAL","nodes":[{"id":"2e8332d0-1fee-44dd-a359-e0d6ecbcefc1","uri":{"scheme":"http","host":"pilosa1","port":10101},"isCoordinator":true},{"id":"8c0dbcdc-9503-4265-8ad2-ba85a4bb10fa","uri":{"scheme":"http","host":"pilosa2","port":10101},"isCoordinator":false}]}

The corresponding Docker Compose file is below:

version: '2'
services: 
  pilosa1:
    image: pilosa/pilosa:latest
    ports:
      - "10101:10101"
    environment:
      - PILOSA_CLUSTER_COORDINATOR=true
      - PILOSA_GOSSIP_SEEDS=pilosa1:14000
    networks:
      - pilosanet
    entrypoint:
      - /pilosa
      - server
      - --bind
      - "pilosa1:10101"
  pilosa2:
    image: pilosa/pilosa:latest
    environment:
      - PILOSA_GOSSIP_SEEDS=pilosa1:14000
    networks:
      - pilosanet
    entrypoint:
      - /pilosa
      - server
      - --bind
      - "pilosa2:10101"
networks: 
  pilosanet:

Running a Docker Swarm

It is very easy to run a Pilosa Cluster on different servers using Docker Swarm mode. All we have to do is creating an overlay network instead of the bridge network.

The instructions in this section require Docker 17.06 and better. Although it is possible to run a Docker swarm on MacOS or Windows, it is easiest to run it on Linux. So we assume you are trying these instructions on Linux, probably on the cloud.

We are going to use two servers: the manager node runs in the first server and a worker node in the second server.

Docker nodes require some ports to be accesible from outside. Before carrying on, make sure the following ports are open on all nodes: TCP/2377, TCP/7946, UDP/7946, UDP/4789.

Let’s initialize the swarm first. Run the following on the manager:

docker swarm init --advertise-addr=IP-ADDRESS

Virtual machines running on the cloud usually have at least two network interfaces: the external interface and the internal interface. Use the IP of the external interface.

The output of the command above should be similar to:

To add a manager to this swarm, run the following command:

    docker swarm join --token SOME-TOKEN MANAGER-IP-ADDRESS:2377

Let’s make the worker node join the manager. Copy/paste the command above in a shell on the worker, replacing the token and IP address with the correct values. You may neeed to add --advertise-addr=WORKER-EXTERNAL-IP-ADDRESS parameter if the worker has more than one network interface:

docker swarm join --token SOME-TOKEN MANAGER-IP-ADDRESS:2377

Run the following on the manager to check that the worker joined to the swarm:

docker node ls

Which should output:

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
MANAGER-ID * swarm1 Ready Active Leader 18.05.0-ce
WORKER-ID swarm2 Ready Active 18.05.0-ce

If you have created the pilosanet network before, delete it before carrying on, otherwise skip to the next step:

docker network rm pilosanet

Let’s create the pilosanet network, but with overlay type this time. We should also make this network attachable in order to be able to attach containers to it. Run the following on the manager:

docker network create -d overlay pilosanet --attachable

We can now create the Pilosa containers. Let’s start the coordinator node first. Run the following on one of the servers:

docker run -it --rm --name pilosa1 --network=pilosanet pilosa/pilosa:latest server --bind pilosa1 --cluster.coordinator=true --gossip.seeds=pilosa1:14000

And the following on the other server:

docker run -it --rm --name pilosa2 --network=pilosanet pilosa/pilosa:latest server --bind pilosa2 --gossip.seeds=pilosa1:14000

These were the same commands we used in the previous section except the port mapping! Let’s run another container on the same virtual network to read the status from the coordinator:

docker run -it --rm --network=pilosanet --name shell alpine wget -q -O- pilosa1:10101/status
{"state":"NORMAL","nodes":[{"id":"3e3b0abd-1945-441a-a01f-5a28272972f5","uri":{"scheme":"http","host":"pilosa1","port":10101},"isCoordinator":true},{"id":"71ed27cc-9443-4f41-88fb-1c22f92bf695","uri":{"scheme":"http","host":"pilosa2","port":10101},"isCoordinator":false}]}

You can add as many as worker nodes to both the swarm and the Pilosa cluster using the steps above.

What’s Next?

Check out our Administration Guide to learn more about making the most of your Pilosa cluster and Configuration Documentation to see the available options to configure Pilosa.

Refer to the Docker documentation to see your options about running Docker containers. The Networking with overlay networks is a detailed overview of the Docket swarm mode and overlay networks.

Using Integer Field Values

Introduction

Pilosa can store integer values associated to the columns in an index, and those values are used to support Range, Min, Max, and Sum queries. In this tutorial we will show how to set up integer fields, populate those fields with data, and query the fields. The example index we’re going to create will represent fictional patients at a medical facility and various bits of information about those patients.

First, create an index called patients:

curl localhost:10101/index/patients \
     -X POST 
{}

In addition to storing rows of bits, a frame can also contain fields that store integer values. The next step creates three fields (age, weight, tcells) in the measurements frame.

curl localhost:10101/index/patients/frame/measurements \
     -X POST \
     -d '{"options":{
              "fields": [
                  {"name": "age", "type": "int", "min": 0, "max": 120},
                  {"name": "weight", "type": "int", "min": 0, "max": 500},
                  {"name": "tcells", "type": "int", "min": 0, "max": 2000}
              ]
         }}'
{}

If you need to, you can add fields to an existing frame by posting to the Create Field endpoint.

Next, let’s populate our fields with data. There are two ways to get data into fields: use the SetFieldValue() PQL function to set fields individually, or use the pilosa import command to import many values at once. First, let’s set some field data using PQL.

This query sets the age, weight, and t-cell count for the patient with ID 1 in our system:

curl localhost:10101/index/patients/query \
     -X POST \
     -d 'SetFieldValue(col=1, frame="measurements", age=34, weight=128, tcells=1145)'
{"results":[null]}

In the case where we need to load a lot of data at once, we can use the pilosa import command. This method lets us import data into Pilosa from a CSV file.

Assuming we have a file called ages.csv that is structured like this:

1,34
2,57
3,19
4,40
5,32
6,71
7,28
8,33
9,63

where the first column of the CSV represents the patient ID and the second column represents the patient’sage, then we can import the data into our age field by running this command:

pilosa import -i patients -f measurements --field age ages.csv

Now that we have some data in our index, let’s run a few queries to demonstrate how to use that data.

In order to find all patients over the age of 40, then simply run a Range query against the age field.

curl localhost:10101/index/patients/query \
     -X POST \
     -d 'Range(frame="measurements", age > 40)'
{"results":[{"attrs":{},"bits":[2,6,9]}]}

You can find a list of supported range operators in the Range Query documentation.

To find the average age of all patients, run a Sum query:

curl localhost:10101/index/patients/query \
     -X POST \
     -d 'Sum(frame="measurements", field="age")'
{"results":[{"sum":377,"count":9}]}

The results you get from the Sum query contain the sum of all values as well as the count of columns with a value. To get the average you can just divide sum by count.

You can also provide a filter to the Sum() function to find the average age of all patients over 40.

curl localhost:10101/index/patients/query \
     -X POST \
     -d 'Sum(Range(frame="measurements", age > 40), frame="measurements", field="age")'
{"results":[{"sum":191,"count":3}]}

Notice in this case that the count is only 3 because of the age > 40 filter applied to the query.

To find the minimum age of all patients, run a Min query:

curl localhost:10101/index/patients/query \
     -X POST \
     -d 'Min(frame="measurements", field="age")'
{"results":[{"min":19,"count":1}]}

The results you get from the Min query contain the min of all values as well as the count of columns with that value.

You can also provide a filter to the Min() function to find the minimum age of all patients over 40.

curl localhost:10101/index/patients/query \
     -X POST \
     -d 'Min(Range(frame="measurements", age > 40), frame="measurements", field="age")'
{"results":[{"min":57,"count":1}]}

To find the maximum age of all patients, run a Max query:

curl localhost:10101/index/patients/query \
     -X POST \
     -d 'Max(frame="measurements", field="age")'
{"results":[{"max":71,"count":1}]}

The results you get from the Max query contain the max of all values as well as the count of columns with that value.

You can also provide a filter to the Max() function to find the maximum age of all patients under 40.

curl localhost:10101/index/patients/query \
     -X POST \
     -d 'Max(Range(frame="measurements", age < 40), frame="measurements", field="age")'
{"results":[{"max":34,"count":1}]}

Storing Row and Column Attributes

Introduction

Pilosa can store arbitrary values associated to any row or column. In Pilosa, these are referred to as attributes, and they can be of type string, integer, boolean, or float. In this tutorial we will store some attribute data and then run some queries that return that data.

First, create an index called books to use for this tutorial:

curl localhost:10101/index/books \
     -X POST
{}

Next, create a frame in the books index called members which will represent library members who have read books.

curl localhost:10101/index/books/frame/members \
     -X POST \
     -d '{}'
{}

Now, let’s add some books to our index.

curl localhost:10101/index/books/query \
     -X POST \
     -d 'SetColumnAttrs(col=1, name="To Kill a Mockingbird", year=1960)
         SetColumnAttrs(col=2, name="No Name in the Street", year=1972)
         SetColumnAttrs(col=3, name="The Tipping Point", year=2000)
         SetColumnAttrs(col=4, name="Out Stealing Horses", year=2003)
         SetColumnAttrs(col=5, name="The Forever War", year=2008)'
{"results":[null,null,null,null,null]}

And add some members.

curl localhost:10101/index/books/query \
     -X POST \
     -d 'SetRowAttrs(frame="members", row=10001, fullName="John Smith")
         SetRowAttrs(frame="members", row=10002, fullName="Sue Perkins")
         SetRowAttrs(frame="members", row=10003, fullName="Jennifer Hawks")
         SetRowAttrs(frame="members", row=10004, fullName="Pedro Vazquez")
         SetRowAttrs(frame="members", row=10005, fullName="Pat Washington")'
{"results":[null,null,null,null,null]}

At this point we can query one of the member records by querying that row.

curl localhost:10101/index/books/query \
     -X POST \
     -d 'Bitmap(frame="members", row=10002)'
{"results":[{"attrs":{"fullName":"Sue Perkins"},"bits":[]}]}

Now let’s add some data to the matrix such that each pair represents a member who has read that book.

curl localhost:10101/index/books/query \
     -X POST \
     -d 'SetBit(frame="members", row=10001, col=3)
         SetBit(frame="members", row=10001, col=5)
         SetBit(frame="members", row=10002, col=1)
         SetBit(frame="members", row=10002, col=2)
         SetBit(frame="members", row=10002, col=4)
         SetBit(frame="members", row=10003, col=3)
         SetBit(frame="members", row=10004, col=4)
         SetBit(frame="members", row=10004, col=5)
         SetBit(frame="members", row=10005, col=1)
         SetBit(frame="members", row=10005, col=2)
         SetBit(frame="members", row=10005, col=3)
         SetBit(frame="members", row=10005, col=4)
         SetBit(frame="members", row=10005, col=5)'
{"results":[true,true,true,true,true,true,true,true,true,true,true,true,true]}

Now pull the record for Sue Perkins again.

curl localhost:10101/index/books/query \
     -X POST \
     -d 'Bitmap(frame="members", row=10002)'
{"results":[{"attrs":{"fullName":"Sue Perkins"},"bits":[1,2,4]}]}

Notice that the result set now contains a list of integers in the bits attribute. These integers match the column IDs of the books that Sue has read.

In order to retrieve the attribute information that we stored for each book, we need to add a URL parameter columnAttrs=true to the query.

curl localhost:10101/index/books/query?columnAttrs=true \
     -X POST \
     -d 'Bitmap(frame="members", row=10002)'
{
  "results":[{"attrs":{"fullName":"Sue Perkins"},"bits":[1,2,4]}],
  "columnAttrs":[
    {"id":1,"attrs":{"name":"To Kill a Mockingbird","year":1960}},
    {"id":2,"attrs":{"name":"No Name in the Street","year":1972}},
    {"id":4,"attrs":{"name":"Out Stealing Horses","year":2003}}
  ]
}

The book attributes are included in the result set at the columnAttrs attribute.

Finally, if we want to find out which books were read by both Sue and Pedro, we just perform an Intersect query on those two members:

curl localhost:10101/index/books/query?columnAttrs=true \
     -X POST \
     -d 'Intersect(Bitmap(frame="members", row=10002), Bitmap(frame="members", row=10004))'
{
  "results":[{"attrs":{},"bits":[4]}],
  "columnAttrs":[
    {"id":4,"attrs":{"name":"Out Stealing Horses","year":2003}}
  ]
}

Notice that we don’t get row attributes on a complex query, but we still get the column attributes—in this case book information.


View markdown source on Github. Last updated 6 months ago.