Archive for category Big Data

Retrieving UTF-8 values from Cassandra

I recently tried out Apache Cassandra which is a NoSQL solution that was initially developed by Facebook and designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure.

In order to populate the database, I used Apache Flume and Flume NG Apache Cassandra Sink which helped me to inject logs into it. But let’s focus on Cassandra here, I will write posts about Flume later on.

This is the Cassandra schema I was using (which is the one suggested by the Cassandra sink):

create keyspace logs with
   strategy_options = {datacenter1:1}

use logs;

create column family records with
   comparator = UTF8Type
   and gc_grace = 86400

After adding the data into the database, I wanted to fetch them to make sure everything went well.
I tried by three different ways:

  1. Cassandra CLI
    To return the first 100 rows (and all associated columns) from the records column family, I ran the following command:

    LIST records;

    However, the rows were looking like:

    RowKey: 6c6f67696b6465763a32653630306661362d633664652d346336612d386561352d323636326533353661616332
    => (column=data, value=39312e36362e3233392e323530202d202d205b32392f4465632f323031323a30393a31383a3433202d303730305d2022
    6f782f31372e3022, timestamp=1357169131235135)
    => (column=host, value=39312e36362e3233392e323530, timestamp=1357169131235134)
    => (column=src, value=6c6f67696b646576, timestamp=1357169131235133)
    => (column=ts, value=323031322d31322d32395431363a31383a34332e3030305a, timestamp=1357169131235132) 

    This behavior is clearly explained on the DataStax page Getting Started Using the Cassandra CLI:

    Cassandra stores all data internally as hex byte arrays by default. If you do not specify a default row key validation class, column comparator and column validation class when you define the column family, Cassandra CLI will expect input data for row keys, column names, and column values to be in hex format (and data will be returned in hex format).
    To pass and return data in human-readable format, you can pass a value through an encoding function. Available encodings are:
    * ascii
    * bytes
    * integer (a generic variable-length integer type)
    * lexicalUUID
    * long
    * utf8

    Which means that we need to specify the encoding in which column family data should be returned. We can do it for the entire client session using the following commands:

    ASSUME records KEYS AS utf8;
    ASSUME records COMPARATOR AS utf8;
    ASSUME records VALIDATOR AS utf8;

    So if we now run the previous command, the rows are looking like:

    RowKey: logikdev:2e600fa6-c6de-4c6a-8ea5-2662e356aac2
    => (column=data, value= - - [29/Dec/2012:09:18:43 -0700] "GET /wp-content/plugins/syntaxhighlighter/syntax
    highlighter2/scripts/clipboard.swf HTTP/1.1" 200 1659 "
    xy/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0", timestamp=1357169131235135)
    => (column=host, value=, timestamp=1357169131235134)
    => (column=src, value=logikdev, timestamp=1357169131235133)
    => (column=ts, value=2012-12-29T16:18:43.000Z, timestamp=1357169131235132)

    Much better! 😉

  2. CQL
    In order to retrieve columns in the records column family, we can use the following SELECT command:

    SELECT * FROM records LIMIT 1;

    However, the output looks like:

     key                                                                                        | column1 | value
     6c6f67696b6465763a31656438616166362d386438322d343330372d386133612d336438363664633133323636 |    data | 3131352e313131

    As you can see, we have a similar problem than in Cassandra CLI. And funny enough, the resolution is also similar except that the syntax is a bit different.
    Indeed, to treat the column values in the table records as being of the type UTF-8, we need to run the following commands:

    ASSUME records(key) VALUES ARE text;
    ASSUME records(value) VALUES ARE text

    First of all, note that we overrode the type as ‘text’ instead of UTF-8 as explained on this page.
    Secondly, the cqlsh help page says we can simply run the command ASSUME records VALUES are text; which is supposed to treat all column values in the table records as being of the type text. But, for some reason, this is not working and we have to run it for each column value.

  3. twitter/cassandra
    Finally, I tried this Ruby client (which can be found here) directly with the Interactive Ruby Shell (IRB).
    The first thing we have to do is to connect to Cassandra using the following commands:

    require 'cassandra'
    client ='logs', '')

    We then can run the command below to get one of the rows:

    client.get(:records, 'logikdev:69a2ef29-599f-48f0-99ef-d4c2ad1f4169')

    This would return something like the following:

    => #" - - [29/Dec/2012:09:48:22 -0700] \"GET /wp-content/plugins/syntaxhighlighter/s
    yntaxhighlighter2/styles/shThemeDefault.css?ver=2.1.364 HTTP/1.1\" 200 3936 \"
    -your-twitter-status-with-java/\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0\"", "host"
    =>"", "src"=>"logikdev", "ts"=>"2012-12-29T16:48:22.000Z"}
    {"data"=>1357169131284005, "host"=>1357169131284004, "src"=>1357169131284003, "ts"=>1357169131284002}>

    And miracle!
    As you can see, there is no need to specify any type this time, the data are already decoded. 🙂

After some more research, I found a better solution to our problem: you can define the validation class in the create statement of the column family. It means that, for our example, the Cassandra schema would be:

create keyspace logs with
   strategy_options = {datacenter1:1}

use logs;

create column family records with
   comparator = UTF8Type
   and gc_grace = 86400
   and key_validation_class = UTF8Type
   and default_validation_class = UTF8Type

With this solution, there is no need to use the ASSUME command, the data will be displayed properly.

, , , , , , , , ,