InfluxDB vs. Cassandra for timeseries data

paulasmuth · on Sept 8, 2016

The linked article is an obviously bullshit benchmark that makes influxdb look good and cassandra look bad (by, surprise, the influxdb folks).

I'm far from a cassandra fanboy, but this really is just dishonest marketing. Not sure if that will work if your product is open source and the target audience are developers.

Some thoughts:

- The reason why cassandra uses so much more space to store the same data is that they've set up the cassandra table schema in such a way that cassandra needs to write the series ID string for each sample (while influxdb only needs to write the values). You easily get a 10-100x blowup just from that. There is no superior "compression" technology here but just an apples-to-oranges comparison.

- Then, comparing the queries is even worse, because they are testing a kind of query (aggregation) that cassandra does not support. To still get a benchmark where they're much faster, they just wrote some code that retrieves all the data from cassandra into a process and then executes the query within their own process. If anything, they're benchmarking one query tool they've written against another one of their own tools.

- Also, if I didn't miss anythin, the article doesn't say on what kind of cluster they actually ran this on or even if they ran both tests on the same hardware. There definitely are cassandra clusters handling more than 100k writes/sec in production right now. So I guess they picked a peculiar configuration in which they outperform cassandra in terms of write ops (given a good distribution of keys, cassandra is more or less linearly scalable in this dimension)

- A better target to benchmark against would probably be http://opentsdb.net/ or http://prometheus.io/ - both seem to have somewhat similar semantics to InfluxDB (which cassandra and elasticsearch do not)

DISC: I also work on a distributed database product (https://eventql.io) but it's neither a direct competitor to Cassandra nor InfluxDB nor any of the other products I've mentioned. I hope the comment doesn't come across as too harsh. The article raised some very big (and harsh) claims so I think it's fair to respond in tone.

brianwawok · on Sept 8, 2016

I don't understand this benchmark at all. It says performance of a 1000 node cluster, but then shows 100k inserts per second in Cassandra. Then later follow up comments say that this test was on a single machine. Without seeing the schema, 100k inserts / sec is reasonable for a single machine. For 1000 machines it would mean there is a pretty massive configuration issue.

If you are going to benchmark a distributed system, you really need to set up more than 1 server.

(Disclaimer - work at Datastax)

paulasmuth · on Sept 8, 2016

This confused me, too.

I think what they meant with "1000 nodes" is that the dataset they're using for the benchmark is synthetic monitoring data (where the thing being monitored are servers).

And the way they generated the synthetic data set is by having 1000 imaginative servers produce one sample per second, (i.e. have a script that writes out 1000 * duration_in_sec fake samples -- I believe this is the code that does it https://github.com/influxdata/influxdb-comparisons/tree/mast...)

brianwawok · on Sept 8, 2016

Makes sense.

Posting 1 node benchmarks of distributed databases seems suboptimal.

kermatt · on Sept 8, 2016

Does it ever make sense to use Cassandra on a single node for anything but dev/test?

I am under the impression that Cassandra's performance comes from its distribution capabilities.

brianwawok · on Sept 9, 2016

It does not make sense to only use 1 node. It's not designed to be a fast 1 node DB.

In fact for most dev I use 3 nodes on my laptop, and most of our "unit" tests are multi-node as well (closer to integration tests by most measures).

pauldix · on Sept 8, 2016

The tests were run on the same hardware, a single server. Bare metal, not VMs. InfluxDB writes the series string with everything. We tried to imitate what you'd need to do to get close to similar functionality doing time series like InfluxDB does in Cassandra.

If you're just going to write a bunch of uint64 keys with float64 values, of course Cassandra will get much faster. It would be trivial to make a time series database that outperforms InfluxDB with those limitations as well.

The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance.

Again, the point is that if you want to do time series on Cassandra, you're going to write a bunch of the code yourself.

paulasmuth · on Sept 8, 2016

> The point of the comparison is that InfluxDB gives you a ton of functionality out of the box and has great performance. [...] if you want to do time series on Cassandra, you're going to write a bunch of the code yourself.

Fair enough. I'm sure InfluxDB is very good/fast at timeseries data (allthough I have to admit to not actually having tried it out so far). Still, if that was your point, consider removing these statements from the blog.

> InfluxDB outperformed Cassandra by 4.5x when it came to data ingestion.

> InfluxDB outperformed Cassandra by delivering 10.8x better compression.

> InfluxDB outperformed Cassandra by delivering up to 168x better query performance.

I think it would help make the point and not put the reader in a defensive position (when the statements are clearly not based on a fair comparison of the two products and will not hold under most conditions). Just my two cents.

pauldix · on Sept 8, 2016

Maybe, but we get asked all the time about Cassandra vs. us. Both in terms of feature set and performance. And performance only makes sense for our potential users if we're trying to replicate the features on Cassandra.

coredog64 · on Sept 8, 2016

Hasn't that work already been done? Cyanite and KairosDB both plug in to the broader Graphite ecosystem (more or less) and use Cassandra as a data store.

Time series data has also been a particular focus in the Cassandra community. DTCS was too complicated, so they came up with the easier and faster TWCS. I don't think this is on you, but I'd love to see a comparison with the latest stable 3.x and a multiple node cluster.

pauldix · on Sept 8, 2016

We'll be doing comparisons against Kairos and OpenTSDB in the coming months. We just get asked about Cassandra specifically quite a bit.

sciurus · on Sept 8, 2016

If you're testing those, it would be nice if you could test and make a comparison with the cassandra-based Blueflood as well.

https://github.com/rackerlabs/blueflood/wiki

brianwawok · on Sept 9, 2016

If you want to test Cassandra, please test at least 9 nodes and have someone with Cassandra setup experience configure your cluster.

twa927 · on Sept 8, 2016

Thanks for the analysis of their benchmark, I wanted to view the details by myself but it required creating an account on their page.

> There is no superior "compression" technology

Isn't it feasible to employ special encoding for time series data? For example, to encode a series of timestamps like 1473333629, 1473333630, 1473333631 you could encode it as 1473333629, +1, +2 (where +1, +2 are encoded in one byte). And there are many cases of such metrics with adjacent values, like averages, counters.

paulasmuth · on Sept 8, 2016

Yes, the delta encoding scheme you described (and other fancy coding schemes such as bitpacking, varints, RLE or a combination thereof) are frequently employed in columnar storage formats and databases. Columnar storage is basically a generalization that allows one to apply these optimizations to all kinds of data (not just timeseries). One popular open-source implementation of columnar storage that I am not affiliated with is https://parquet.apache.org/.

(On the other hand, columnar storage also has a bunch of tradeoffs/downsides so it's not a superior choice for every db product.)

My point about no "superior compression technology here" was specific to the linked benchmark. I.e. the lack of this potential optimization in cassandra does not appear to be the reason for the space blowup in the benchmark, but rather that they're duplicating the series ID for each sample.

michaelcampbell · on Sept 8, 2016

A commercial DB that (also) does this HP Vertica. They tout a 4:1 to 5:1 compression ratio on average; due to the nature of the data the firm I work for stores in it, we get quite a bit better than that. Delta encoding is just one of maybe 5 different schemes it can use for a given column.

paulasmuth · on Sept 8, 2016

Just so sad that Vertica is proprietary so we can't see how they did it! ;)

On a serious note: Please check out EventQL [0] some time. It's very similar to Vertica in some ways and completely open-source. It's a new project (beta) and not nearly as mature as vertica yet though (still a long way to go).

[0] https://eventql.io/

msiebuhr · on Sept 8, 2016

Facebook does this (and quite a few other tricks) for storing time-series data in Gorilla (in-memory TSDB, Paper: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf), getting to 1,37 B per sample.

Prometheus implemented the Gorilla-bits (see https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chu...) and reports getting down to 1,28 B per sample on some workloads, though at a cost of increased query-latencies.

daenney · on Sept 8, 2016

The conclusion isn't entirely surprising, "we from X say that engine X is better than engine Y" but there are many companies that have monitoring stacks built on top of Cassandra, like SignalFX. They have a presentation or two on the topic too that might be interesting: http://www.slideshare.net/planetcassandra/signalfx-making-ca...

Ultimately this benchmark will be heavily influenced by the code written to "emulate" the InfluxDB parts on top of Cassandra and how much of that code puts Cassandra at a disadvantage. I'd like to hear from some people that have built such solutions on top of Cassandra what they think about the benchmark and see how that benchmark would evolve.

soundoflight · on Sept 8, 2016

From using InfluxDB (up to v0.10 I think it was), it's a great database but performance REALLY depends on the cardinality of your data.

I can't stress it enough, calculate your cardinality before switching over to it. If your cardinality looks good, InfluxDB is a perfect, logical choice. I really enjoyed it and it is dirt simple to figure out. We had a junior dev just out of college with little experience set it up and get a high level of proficiency in a matter of hours.

Edit: I should point out, I was doing about 10 million records on my db (hosted on a Mac Mini in development!) a day with a 2 week sliding window. I was pushing the data from InfluxDB into custom D3 visualizations. I would cache certain queries in Redis, so I wasn't always hitting InfluxDB with each read request.

pauldix · on Sept 8, 2016

We're working on the cardinality problem. Will be resolved in an upcoming release. Moving the index over to a disk based format that will hopefully still be fast and not sacrifice lookup performance.

kermatt · on Sept 8, 2016

Can you explain the cardinality problem in a bit more detail? Its come up more than once in this thread.

soundoflight · on Sept 8, 2016

https://docs.influxdata.com/influxdb/v1.0/concepts/glossary/...

You want to keep the amount of different data that you are indexing/tagging on low. As an example with my situation, I was tracking what could be amounted to connections between nodes in a very large tree. I had a lot of distinct pairs, which means that I had a high cardinality. When the cardinality increases a query that used to take a millisecond to load could move to a couple seconds.

kermatt · on Sept 8, 2016

So InfluxDB v1.0 has issues with the cardinality of the "primary key" (or candidate keys) gets high?

At what level of keys or tags did you start to see query performance become problematic?

soundoflight · on Sept 8, 2016

Good to hear! I have a project coming up soon that I want to use it on.

tychuz · on Sept 8, 2016

Just looking at the domain is easy to guess which one will win...

klucar · on Sept 8, 2016

Has anyone successfully compiled their benchmark code? https://github.com/influxdata/influxdb-comparisons

I added code to the data generator to work with Timely (https://nationalsecurityagency.github.io/timely/) but can't get it compiled.

Also, it seemed that ingest and query were separate stages. Queries should be run while ingest is running to get real-world performance, but I understand it is more difficult to test this way.

dz0ny · on Sept 8, 2016

It would be interesting to compare memory requirements, I chose Influxdb because it had 10 times lower memory usage. The dataset was small (couple of million datapoints)... but stil

dx034 · on Sept 8, 2016

That only works when you have one series with a lot of observations. If you have many series with fewer observations (say 50k per series) influxDB uses absurd amounts of memory. I had to switch back to Cassandra because I constantly ran out of memory.

pauldix · on Sept 8, 2016

We're working on solving the high cardinality problem. Hopefully soon

deluvas · on Sept 8, 2016

How much memory are we talking about? How often did you execute queries on the data?

I'm asking because my first impression of Influxdb involved lots of memory gobbling.

LogicX · on Sept 8, 2016

Not sure why this blog post from July made it to the front page now.

Though 1.0 GA is being released today.