A Calcite adapter for Apache Cassandra



For those not familiar, Apache Calcite is a generic SQL query optimizer which can execute SQL queries over multiple backend data sources. This is a powerful concept because it allows complex queries to be executed over sources which provide much simpler interfaces from CSV files to MongoDB. Calcite is also leveraged as the cost-based-optimizer framework for the Apache Hive data warehouse.

Much of my PhD research has revolved around generating optimized schemas for NoSQL databases such as Apache Cassandra. (For a proof-of-concept tool, check out the NoSQL Schema Evaluator.) On discovering calcite, this seemed like a good fit with my work. One of the challenges with using NoSQL databases for complex queries is the necessity of working within the restrictions set by the query language. In previous work, I built a very simple query execution on top of Cassandra designed to execute a predefined set of query plans. Leveraging Calcite, it is possible to execute a very complete dialect of SQL on top of any defined data source (which calcite calls “adapters”).

Unfortunately, Calcite did not already have an adapter for Cassandra. Fortunately, writing an adapter is a fairly straightforward process, so I decided to take this on. The simplest possible implementation of an adapter provides a set of tables along with a scan operator to retrieve all the rows in the tables. While this is sufficient to enable Calcite to perform query execution, scanning a table in Cassandra is very inefficient. This is a result of the fact that partitions in a Cassandra table are commonly distributed across nodes via hash partitioning. While it is possible to retrieve all rows, they will be produced in a random order and the query will need to contact all nodes in the database. Assuming that the query the user wants to issue does not need to touch all rows in a table, it is possible to use filtering in the Cassandra Query Language (CQL) to push filtering down to Cassandra.

The current version of the adapter also supports exploiting the native sort order of Cassandra tables by clustering key. There is still a lot of work to be done, but an initial version of this adapter should be shipped in Calcite 1.7.0. Until the release, you’ll have to compile from source. A quick set of commands to get things running is below.

$ git clone https://github.com/apache/calcite.git
$ cd calcite
$ mvn install

# You will need to create a JSON document which provides connection information
# An example can be found in ./cassandra/src/test/resources/model.json
$ ./sqlline
sqlline> !connect jdbc:calcite:model=path/to/cassandra/model.json admin admin

At this point you can write SQL queries which reference your Cassandra tables. Note that table names need to be quoted and there will likely be some failures with certain query patterns. You can view the proposed plan for a query by prefixing it with EXPLAIN PLAN FOR in the sqlline shell. This will show whether the query is able to exploit filtering or sorting directly in CQL. This is a long way from making Cassandra a viable data warehouse, but it may be helpful for performing occasional analytical queries without needing to write a significant amount of code.

Update: March 27, 2016

Calcite 1.7.0 has now been released which includes the Cassandra adapter. In addition to what was discussed above, the adapter also now automatically recognizes materialized views. Documentation is available on the Calcite website.