This is a follow-on post to multi-cloud benchmarks.
Previously, we ran a number of Pilosa specific benchmarks across a variety of configurations of hosts on AWS, Azure, and Oracle cloud. Our broad strokes conclusions were that AWS was the fastest, Oracle was the most cost effective, and that while not far behind, Azure didn’t stand out - if you work on Azure (or any other cloud incidentally) and want to dig into this, we’d love to.
We’ve got a few exciting updates in this second edition:
c5.9xlargebenchmarks on Amazon Linux to be equitable with the Oracle benchmarks.
So, if you’ll recall, we had a suite of queries that we ran against Pilosa clusters configured in each cloud, as well as a set of microbenchmarks that just ran on one instance of each cluster.
So, without further ado, here is the full set of configurations that we’ve benchmarked against. These are the aggregate numbers across each cluster, not for a single instance of the given type.
|Cloud||Instance Type||Num||OS||Cost/ Hr||CPUs||Mem (GB)||NVME SSDs|
“CPUs” here is the number of logical cores as reported by
/proc/cpuinfo - usually that means hyperthreads, though for Azure’s F16, it does mean physical cores.
There are a few nuances to note here. Oracle and Amazon provide custom linux distributions (both based on CentOS), and we’ve run some of the configurations on both Ubuntu and CentOS. Azure and GCP didn’t seem to have hand curated Linux derivatives and they did have official Ubuntu images, so we used those.
We created a
custom-36-73728 instance to be equivalent to AWS’s c5.9xlarge -
we’re even able to specify that we want Skylake class CPUs. Now, the base price
for this custom instance is about $1.51/hr which is almost exactly the same as
AWS’s c5.9xlarge at $1.53/hr. However, if we run the instance for more than 25%
of a month, we start getting discounted automatically. Long story short, if we
keep the instance running for a month, the effective price is $1.06/hr — a 30%
discount! We look at both the full price and discounted price in our
cost/performance comparisons, the discounted price is associated with the
instance type called
Note that while all the providers have some form of “reserved” pricing where you can commit in advance to a year or more of usage for a steep discount, Google is the only one I’m aware of with any kind of totally automatic discounting.
Let’s look first at IntersectionCount which is a simple, single-threaded benchmark with no I/O:
Immediately, we can see the Oracle Linux provided a big boost over Ubuntu for
the Oracle bare metal instances.
BM.HPC2.36 has dethroned
c5.9xlarge as the
champion of CPU performance - this is pretty surprising as the AWS instance has
a faster processor on paper. Is virtualized vs bare metal the culprit? Or
perhaps differences in the memory subsytems give OCI the edge here.
Now what about basic disk I/O?
Very interesting! The bare metal HPC instance using Oracle Linux with 1 SSD outperforms the 2 SSD VM instances (running Ubuntu) both on Oracle and AWS. The non-SSD Oracle and AWS instances also show marked improvement running their respective official OS images instead of Ubuntu.
Let’s look at the concurrent import benchmark which tests CPU, Memory, and I/O across multiple cores.
So it seems like the AWS c5.9xlarge holds up a little bit better under these
mixed/concurrent conditions. There are quite a few variations on the concurrent
import in the raw results, and AWS does quite well in all of them. I suspect
that high EBS bandwidth and having multiple SSDs in the
r5d case has something
to do with this. Possible Oracle’s DenseIO2.16 would have fared a bit better if
we’d run it with Oracle Linux.
Let’s look at raw performance for the queries - this time around, I’ve posted all of the charts for your perusal, and I’ll just provide some commentary on a few things:
We see all 4 clouds making appearances in the top 3 of these query benchmarks which have no disk I/O component. AWS wins 6 of 8, and Azure comes in first or second in 6. Oracle has 4 top-3 appearances with 1 win, and GCP has 2 3rds and some very close 4ths. To be fair, we’ve only tested one GCP configuration, so they have fewer chances to win.
And now the cost/performance in dollars per megaquery:
GCP shows its true colors! With that automatic discounting, Google’s cloud is extremely cost effective when you’re running instances the majority of the time for periods of 1 month or more.
Another thing to note is the huge difference that Oracle Linux makes for the
BM.HPC2.36 instance type. In most cases it’s way ahead of the version running
Ubuntu. Except that one GroupBy query where the Ubuntu version gets third and
the Olinux version is way up in 10th. Weird.
On OCI, the
VM.Standard2.16 instance type runs on the
hardware. We’d previously been confused when Pilosa seemed to perform better
running in a 3-node cluster of
VM.Standard2.16 than running on a single
BM.Standard2.52 which has slightly more horsepower both in CPU and Memory than
all 3 VMs combined. One theory was that the three VMs were allocated on
different physical hosts, and had access to more memory bandwidth in aggregate
than a single
BM.Standard2.52. To test this, we ran a large suite of memory
bandwidth benchmarks across several configurations using this really excellent
benchmarking tool by Zack Smith.
We ran these with varying amounts of concurrency by running multiple instances
bandwidth program in parallel and then summing each result as directed
by the documentation. However, there doesn’t seem to be any mechanism for
ensuring that the same tests are running simultaneously in each instance.
Looking at the output, they seem to stay mostly in sync, but one might take the
results at higher concurrency levels with a grain of salt. What follows are
charts of a small subset of the results - there are more on
and way, way more if you run the entire suite yourself. (one run is 1500 tests,
and we ran them all on four different concurrency levels):
You can really watch the
BM.Standard2.52 pull away from the
VM.Standard2.16 at higher concurrencies.
1 MB writes follow the same pattern.
More of the same. There are lots of other combinations, but the story is pretty similar all over.
Essentially, it doesn’t appear that the performance difference between the
VM.Standard2.16 cluster and the
BM.Standard2.52 machine are due to memory
bandwidth constraints. At high concurrency, we often see
having approximately triple the bandwidth of the
VM.Standard2.16 system. So,
for now, this mystery remains unsolved - my next theory would be that something
in the Golang runtime (perhaps the scheduler) has some performance degredation
at high core counts and is more efficient on a machine with 32 logical cores
than one with 104. This is pure speculation, however.
Using official Linux images helps pretty consistently on Oracle. The story is a lot more mixed on Amazon. I would love to know what sort of specific tuning is responsible for this, though I’m sure there are myriad kernel parameters that one might tweak to get the most out of a specific hardware configuration in a multi-tenant virtualized environment.
Google’s automatic discounting is a significant advantage, though in my
estimation, Oracle still wins on overall cost effectiveness. The GCP instances
never overtake OCI by much in the $/MQ department, and the
instances have over 3x (!!) the memory.
Amazon still takes the best overall raw performance though Azure and OCI do pop up. Testing OCI’s DenseIO instances with Oracle Linux and figuring out how to get NVME SSDs on Azure and GCP would likely make for more equitable all-around comparison. It’s worth noting that even without NVME, the AWS instances on EBS still do pretty well.
I’d still like to do some more low level benchmarking (like the memory bandwidth stuff) to get baseline performance of each aspect of each configuration’s hardware.
More importantly though, I think, I’d like to do more repeated runs on the same configurations and see what kind of consistency we’re getting. Some of the results presented here are difficult to explain, but repeated runs can yield significant variation.
Even without taking multi-tenancy into account, there are lots of factors that contribute to inconsistency. The language runtime comes immediately to mind - especially with garbage collection, but there are also OS tasks and other user-level programs potentially taking resources and cluttering up the CPU cache. Anything doing I/O is subject to the vagaries of external hardware, which is only exacerbated in the case of network mounted storage.
Banner Photo by Victor Rodriguez on Unsplash
Stay updated on our progress and upcoming events.