Connecting...

W1siziisimnvbxbpbgvkx3rozw1lx2fzc2v0cy9zawduawz5lxrly2hub2xvz3kvanbnl2jhbm5lci1kzwzhdwx0lmpwzyjdxq

Cassandra writes in depth by Andrzej Ludwikowski

W1siziisijiwmtkvmdmvmtqvmtavmdavmjevmzcxl3dvbwfulxr5cgluzy13cml0aw5nlxdpbmrvd3muanbnil0swyjwiiwidgh1bwiilci5mdb4otawxhuwmdnlil1d

Get the Cassandra low-down!

Software Journeyman at SoftwareMillAndrzej Ludwikowski gives us the catch-up we need on the scalable database, Cassandra. Have you used it in your programming?

What would be your Cassandra top tip?

 

Cassandra is a super fast and scalable database. In the right context, this statement is more or less true. Of course, our context is how we are using this database. And believe me, if you have never used distributed databases before, this would be a completely different experience. Many people argue that Cassandra is actually not that fast when it comes to reads. Well, in many cases that’s true, and in many cases this is only an effect of a wrong understanding of this tool. For sure it is hard to compete with Cassandra when you are comparing write performance, so let’s deep dive into the write path.

There is no point of describing this concept again, so if you are not familiar with Cassandra’s architecture, just check resources below, otherwise skip to the next section.

  1. https://www.youtube.com/watch?v=d9NvnMcTVdQ
  2. https://wiki.apache.org/cassandra/WritePathForUsers
  3. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
 
 

Why are writes so fast?

What if I told you that when you get the response from Cassandra, that your write operation was successful, your data had been stored only in memory even for 10 seconds (the default setting) before it was flushed to the commit log? Keep in mind, that commit log guarantees durability in case of a node restart or failure. Don’t believe me? Just read the documentation carefully.

With a quite small load for Cassandra, like 10 000 writes/s, you can end up with 100 000 business operations stored, once again, only in memory, just like in Redis.

Before you start panicking and moving all your data to a good old RDBMS, hold on. Cassandra is a distributed DB. You should never launch it on production with a single node setup. If you have reasonable number of nodes in different racks/availability zones, you are pretty much fine, but still — you can never be 100% sure that you won’t loose any data. This is how it works, nothing comes for free, especially in terms of performance. To be honest, if you think that your RDBMS is safer to use when it comes to durability, you are just lying to yourself. Data durability and replication can be a topic of many other blog posts, but our main focus for today is the Cassandra write path.

 

Sequential updates

Let’s examine the simplest possible scenario — a few updates for the same primary key, one after another.

1 CREATE TABLE orders (
2   id int,
3   status text,
4   country text,
5   PRIMARY KEY (id)
6 );
1 INSERT INTO orders (id, status, country) VALUES ( 1, 'added', 'PL');
2 INSERT INTO orders (id, status, country) VALUES ( 1, 'verified', 'PL');
3 INSERT INTO orders (id, status, country) VALUES ( 1, 'shipped', 'PL');

How would this be persisted to the SSTable? You can verify it thanks to the sstabledump util. Don’t forget to flush data to disk before using it.

1  ./sstabledump ~/.ccm/test/node1/data0/my_keyspace/orders-8b2e7e2022d111e9abc5abfbe212c0f3/mc-1-big-Data.db                                                                                          
2  [
3   {
4     "partition" : {
5       "key" : [ "1" ],
6       "position" : 0
7     },
8     "rows" : [
9       {
10        "type" : "row",
11        "position" : 38,
12        "liveness_info" : { "tstamp" : "2019-01-28T07:51:53.493550Z" },
13        "cells" : [
14          { "name" : "country", "value" : "PL" },
15          { "name" : "status", "value" : "shipped" }
16        ]
17      }
18    ]
19  }
20 ]

Nothing special, last write wins, everything as expected.

 

Delayed updates

How would this behave if we update the order status after a week? To simulate this let’s flush after each insert statement. Also, I tweaked a little bit the default compaction strategy to be less aggressive:

ALTER TABLE orders WITH compaction = {‘class’ : 

‘SizeTieredCompactionStrategy’, ‘min_threshold’ : 6 };

1 INSERT INTO orders (id, status, country) VALUES ( 2, 'added', 'UK');
2 #flush
3 INSERT INTO orders (id, status, country) VALUES ( 2, 'verified', 'UK');
4 #flush
5 INSERT INTO orders (id, status, country) VALUES ( 2, 'shipped', 'UK');
6 #flush

Now we have 4 SSTables:

1 -rw-rw-r-- 1 andrzej andrzej   44 sty 28 09:21 mc-1-big-Data.db
2 -rw-rw-r-- 1 andrzej andrzej   41 sty 28 09:22 mc-2-big-Data.db
3 -rw-rw-r-- 1 andrzej andrzej   44 sty 28 09:22 mc-3-big-Data.db
4 -rw-rw-r-- 1 andrzej andrzej   43 sty 28 09:22 mc-4-big-Data.db

and our updates for second order are spread across 3 of them:

1 nodetool getsstables my_keyspace orders '2'                                                                                                                                                
2
3 .../my_keyspace/orders-90eefa7022d511e9abc5abfbe212c0f3/mc-4-big-Data.db
4 .../my_keyspace/orders-90eefa7022d511e9abc5abfbe212c0f3/mc-3-big-Data.db
5 .../my_keyspace/orders-90eefa7022d511e9abc5abfbe212c0f3/mc-2-big-Data.db

In details:

1  ./sstabledump mc-2-big-Data.db                                                                                          
2  [
3   {
4     "partition" : {
5       "key" : [ "2" ],
6       "position" : 0
7     },
8     "rows" : [
9       {
10        "type" : "row",
11        "position" : 33,
12        "liveness_info" : { "tstamp" : "2019-01-28T08:22:05.237814Z" },
13        "cells" : [
14          { "name" : "country", "value" : "UK" },
15          { "name" : "status", "value" : "added" }
16        ]
17      }
18    ]
19  }
20 ]                                                                                                                                                                                                                        
21 ./sstabledump mc-3-big-Data.db                                                                                           
22 [
23  {
24    "partition" : {
25      "key" : [ "2" ],
26      "position" : 0
27    },
28    "rows" : [
29      {
30        "type" : "row",
31        "position" : 36,
32        "liveness_info" : { "tstamp" : "2019-01-28T08:22:15.525845Z" },
33        "cells" : [
34          { "name" : "country", "value" : "UK" },
35          { "name" : "status", "value" : "verified" }
36        ]
37      }
38    ]
39  }
40 ]                                                                                                                                                                                                                        
41 ./sstabledump mc-4-big-Data.db                                                                                          
42 [
43  {
44    "partition" : {
45      "key" : [ "2" ],
46      "position" : 0
47    },
48    "rows" : [
49      {
50        "type" : "row",
51        "position" : 35,
52        "liveness_info" : { "tstamp" : "2019-01-28T08:22:25.269595Z" },
53        "cells" : [
54          { "name" : "country", "value" : "UK" },
55          { "name" : "status", "value" : "shipped" }
56        ]
57      }
58    ]
59  }
60 ]

In other words, to provide a result for such query:

1 cqlsh:my_keyspace> SELECT * FROM orders WHERE id = 2;
2
3 id | country | status
4 ----+---------+---------
5  2 |      UK | shipped
Cassandra needs to merge information about the status column from 3 different files. Of course, the more SSTables Cassandra needs to reach to answer a query, the slower the overall query performance is.
 

Idempotent writes

Idempotent writes are one of my favourite Cassandra features. Why? Long story short, with idempotent writes you can achieve effectively-once delivery semantics and, in a scalable distributed system, this is a crucial attribute.

Are idempotent writes handled differently than any other write operation?

1 INSERT INTO orders (id, status, country) VALUES ( 3, 'added', 'IT');
2 #flush
3 INSERT INTO orders (id, status, country) VALUES ( 3, 'added', 'IT');
4 #flush
5 INSERT INTO orders (id, status, country) VALUES ( 3, 'added', 'IT');
6 #flush

Of course not:

1 nodetool getsstables my_keyspace orders '3'                                                                                                                                                
2 
3 .../my_keyspace/orders-90eefa7022d511e9abc5abfbe212c0f3/mc-8-big-Data.db
4 .../my_keyspace/orders-90eefa7022d511e9abc5abfbe212c0f3/mc-7-big-Data.db
5 .../my_keyspace/orders-90eefa7022d511e9abc5abfbe212c0f3/mc-6-big-Data.db

In this case, the exact same information is duplicated across 3 SSTable files. Is it a problem? In most cases not, because write attempts with the same statement occur within a short time range, so they are compensated in memory, before writing to disk. On the other hand, if you need to reprocess some old data you trade this data duplication (at least till next compaction) for writes idempotency and you don’t have to care about duplicated writes.

 

Tombstones

Cassandra tombstones are pretty famous, you can find a dozen of articles about their issues I won’t go into the details here, but show a short summary of 2 types of tombstones instead.
 

Cell level

1 cqlsh:my_keyspace> INSERT INTO orders (id, status, country) VALUES ( 4, 'added', 'PL');
2 #flush
3 cqlsh:my_keyspace> DELETE country FROM orders WHERE id = 4;
4 #flush
5
6 nodetool getsstables my_keyspace orders '4'                                                                                                                                                
7 .../my_keyspace/orders-90eefa7022d511e9abc5abfbe212c0f3/mc-11-big-Data.db
8 .../my_keyspace/orders-90eefa7022d511e9abc5abfbe212c0f3/mc-10-big-Data.db

What’s in 'mc-11-big-Data.db'?

1  ./sstabledump ~/.ccm/test/node1/data0/my_keyspace/orders-90eefa7022d511e9abc5abfbe212c0f3/mc-11-big-Data.db                                                                                          11:41:32
2  [
3   {
4     "partition" : {
5       "key" : [ "4" ],
6       "position" : 0
7     },
8     "rows" : [
9       {
10        "type" : "row",
11        "position" : 24,
12        "cells" : [
13          { "name" : "country", "deletion_info" : { "local_delete_time" : "2019-01-28T10:41:25Z" },
14            "tstamp" : "2019-01-28T10:41:25.125459Z"
15          }
16        ]
17      }
18    ]
19  }
20 ]

After compacting:

1  {
2     "partition" : {
3       "key" : [ "4" ],
4       "position" : 94
5     },
6     "rows" : [
7       {
8         "type" : "row",
9         "position" : 134,
10        "liveness_info" : { "tstamp" : "2019-01-28T10:38:58.742212Z" },
11        "cells" : [
12          { "name" : "country", "deletion_info" : { "local_delete_time" : "2019-01-28T10:41:25Z" },
13            "tstamp" : "2019-01-28T10:41:25.125459Z"
14          },
15          { "name" : "status", "value" : "added" }
16        ]
17      }
18    ]
19  }

 

Partition level

1 cqlsh:my_keyspace> INSERT INTO orders (id, status, country) VALUES ( 5, 'added', 'PL');  
2 #flush
3 cqlsh:my_keyspace> DELETE FROM orders WHERE id = 5;
4 #flush

After compacting:

1 {
2    "partition" : {
3      "key" : [ "5" ],
4      "position" : 0,
5      "deletion_info" : { "marked_deleted" : "2019-01-28T10:52:55.690805Z", "local_delete_time" : "2019-01-28T10:52:55Z" }
6    },
7    "rows" : [ ]
8  },

Depending on your table schema, you can also observe row level tombstones and range tombstones. I encourage you to check this by trying them on your own.

What is important to remember about tombstones is that instead of removing data, we store even more information, usually in different SSTables and this data is removed (or comes back as a zombie) after gc_grace_seconds (10 days by default).

 

Summary

The main takeaway after reading this post should be that your read performance problems start with your writes. Although Cassandra uses very fancy mechanisms for optimizing the read path, the rule of thumb is simple: keep your partition on a single SSTable! If that’s not possible, try to minimize the spread ratio. Disk seek is one of the slowest operations while reading. Before you start tuning Cassandra JVM configuration, check different compaction strategies. Validate your configuration and schema design with SSTable utilities. And of course, don’t ignore a good monitoring system that can signal potential problems before they hit a critical level.

What is the best load type for Cassandra? No updates, no deletes, just append-only operations. If you think about Event Sourcing — bingo! Event store with 'SizeTieredCompactionStrategy' is a perfect match for this database.'

 
This article was written by Andrzej Ludwikowski and posted originally on SoftwareMill Blog.