Simulation of Cassandra nodes in local environments, closer to developers.

20 minutes reading

There are a lot of information spread via internet about Cassandra and this consideration includes its official website, where we can even find some features not documented yet. We will not explain here the Cassandra’s installation process, you can find it on:

So, what I want to explain is what are the steps to simulate the behaviour of a cluster with 2 Cassandra nodes, mainly from the point of view of the developer but before do that let me tell you something that should be known Cassandra infrastructure:

A Cluster: 1 – n Datacenters

A Datacenter: 1- n Nodes

A Node is an instance of Cassandra Data Base

One of the most important feature in the structure indicated above is that it does not have single point of failure, so data can go to any Node in an specific DataCenter in a Cluster and replicate data around the ring.

Before persist any data to Cassandra must be first written to a commit log and then saved to a memtable. When you start Cassandra database any mutations in the commit log will be applied to memtables, every Node in this architecture has the same structure. Configuring commit log is truly important because it will let optimize all mutations writes. So there are some properties that you should take a look before begin to operate with Cassandra.

You can find these properties in Cassandra configuration file($CASSANDRA_HOME/conf/cassandra.yaml):

  • commitlog_segment_size_in_mb:[32]
  • commitlog_sync:[periodic]
  • commitlog_sync_batch_window_in_ms
  • commitlog_sync_period_in_ms:[10000]
  • commitlog_total_space_in_mb
  • memtable_cleanup_threshold
  • memtable_allocation_type

The Cassandra’s properties already have default values. They work properly in most cases but if you need some tailored behavior then you should modify the specific property manually. Value between brackets are default values, pay attention to commitlog_total_space_in_mb because an small value will cause more flush activity on less-active column families. This is what the specification say: The default value is the smaller of 8192, and 1/4 of the total space of the commit log volume.

We are going to simulate a cluster with 2 Cassandra nodes in a development environment:

  1. We will create a Cluster that contains a DataCenter with 2 Cassandra nodes. In a real environment, each Node should be in a different Machine (virtual or not). For example in each Ubuntu a Node (an Instance of Cassandra) with its corresponding Jdk and its Phyton’s installation if your are planning to use cqlsh. However, in our tests all nodes will be in the same machine, making use of the same installation of Java and Phyton respectively.
  2. To simulate two Cassandra instances we have to download a Cassandra installation and unzipped it into 2 different folders. When you download the installation all the routes established in Cassandra configuration file ($CASSANDRA_HOME/conf/cassandra.yaml) are commented and when Cassandra analyze it internally, it is assumed that the root of all routes is the Cassandra Home Directory. If we want route that point to other paths, we must manually change it in the specific Cassandra.yaml file for each installation.
  3. Modify Cassandra configuration file ($CASSANDRA_HOME/conf/cassandra.yaml)
  • [Node1].The first thing you should do is to change the cluster to which all our nodes belong. Therefore, all nodes that belong to the same cluster must have the same cluster name in their configuration files. [Cluster_name:]
  • [Node1]. Change the directory where It will store the data files of the database.(It’s Not mandatory(ref 2. above about default values)). [data_file_directories:$CASSANDRA_HOME/data/data]
  • [Node1]. Change the directory where we will store the commitLog.(It’s Not mandatory(ref 2. above about default values)). [commitlog_directory:$CASSANDRA_HOME/data/commitlog]
  • [Node1]. So that the nodes can be found among them and and learn the topology of the ring.(It’s Not mandatory(ref 2. about default values) but if you have more than one Node this value must be changed). [seed_provider:]
  • [Node1]. Change the directory where we will store the caches.(Idem ref default values)). [saved_caches_directory: $CASSANDRA_HOME/data/saved_caches]
  • [Node2]. Idem [Node1].
  • [Node2]. We have to change the listening IP address of our Node2 by another different than the default because it had been used by Node1. We must change this if we want multiple nodes to be able to communicate.Mainly we set here the address or interface to bind to and tell other Cassandra nodes to connect to [listen_address:]
  • [Node2]. We changed our ip network interface [rpc_address:]. I have to highlight several aspects:

– Our listen_address = rpc_address.

– When setting our rpc_address we will have to define a Network interface that allows them to communicate. I will do it with the following command:

sudo ifconfig lo0 alias .We have tested it on OSX and Centos 7.

In $CASSANDRA_HOME/bin/ cassandra.in.sh I will define the configuration variables to be used. This file tell us what folders we will be using while our Cassandra Node is running. It can be changed to any other path. Pay attention to the permission to files/folders that we changed manually in this file.

Configuration variables are detailed in cassandra configuration file, including their default values.

I will put special emphasis on the following:

Partitioner is responsible for the distribution of the different groups in the DataCenter / Cluster. For many developers who come from the world of relational bbdd, this aspect is extremely important, as we will see later, because the partition of our database influences on the optimization of data manipulation operations.

It has an optimal default value (org.apache.cassandra.dht.Murmur3Partitioner). The value of this property can not be changed once the database is created without reloading all the data again.

It would be quite curious to see how much time we spent configuring our CACHE. We have an internal CACHE that we implement in our applications (http://www.ehcache.org or https://redis.io), CACHE for http request that reduce our traffic (http://www.squid-cache.org , https://varnish-cache.org, https://nginx.org) and so on we could continue with another article dedicated only to CACHEs. In any case, the treatment of CACHEs in Cassandra is implemented very carefully and there are a set of properties that allow us to perform a tunning of it:

  • prepared_statements_cache_size_mb
  • key_cache_size_in_mb
  • key_cache_save_period
  • key_cache_keys_to_save
  • row_cache_class_name
  • row_cache_size_in_mb
  • row_cache_save_period
  • row_cache_keys_to_save
  • counter_cache_size_in_mb
  • counter_cache_save_period
  • counter_cache_keys_to_save

The above properties will allow us to tune the Cassandra cache, configure its freshness, its size depending on how dynamic or static are our data is. My recommendation is first configure the Cassandra CACHE in the most optimal way and then go for the application CACHE and finally the CACHEs of the requests in case that our project invokes services or requests.

Once I have configured our basic properties in our nodes we will start the database schemas/tables creation. This point is pretty well documented at the product website (http://cassandra.apache.org/doc/latest/cql/ddl.html#).

Replication: Is specific for each KeySpace and determine which nodes are replicas for a given token range.

We can create our Keyspace when all Nodes are up (which would be the correct operation) or when only one Node is up. In the latter case when Cassandra creates the keyspace and tries to replicate it the operation throws an obvious error indicating that a Node could be down (indicated below) but creates the keyspace on the active node without any problems.

The proper choice of the partition key and clustering columns for a table is probably one of the most important aspect of data modeling in Cassandra and it largely impact which queries can be performed and how efficiently they are.

Why is it so important?

[partition key] Writing operations in those rows that belong to the same partition are done atomically and in isolation, something very different from when we do it through different partitions.

[partition key] Replicas of the rows that belong to the same partition (the same partition key) will be located on the same node.

[clustering order] It defines the order within the partition for that table.

All we have made tests more closer to the developer real life in our work with Cassandra. Simulating how work with Cassandra in a development environment. You can request any information about any node via Cassandra node tool and any ddl or dml command with cqlsh.