• This repository has been archived on 18/Sep/2021
  • Stars
    star
    244
  • Rank 165,885 (Top 4 %)
  • Language
    Scala
  • License
    Other
  • Created almost 13 years ago
  • Updated about 12 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Scala client for Cassandra

Cassie

Cassie is a small, lightweight Cassandra client built on Finagle with with all that provides plus column name/value encoding and decoding.

It is heavily used in production at Twitter so such be considered stable, yet it is incomplete in that it doesn't support the full feature set of Cassandra and will continue to evolve.

Requirements

  • Java SE 6
  • Scala 2.8
  • Cassandra 0.8 or later
  • sbt 0.7

Note that Cassie is usable from Java. Its not super easy, but we're working to make it easier.

Let's Get This Party Started

In your simple-build-tool project file, add Cassie as a dependency:

val twttr = "Twitter's Repository" at "http://maven.twttr.com/"
val cassie = "com.twitter" % "cassie" % "0.19.0"

Finagle

Before going further, you should probably learn about Finagle and its paradigm for asynchronous computing– https://github.com/twitter/finagle.

Connecting To Your Cassandra Cluster

First create a cluster object, passing in a list of seed hosts. By default, when creating a connection to a Keyspace, the given hosts will be queried for a full list of nodes in the cluster. If you don't want to report stats use NullStatsReceiver.

val cluster = new Cluster("host1,host2", OstrichStatsReceiver)

Then create a Keyspace instance which will use Finagle to maintain per-node connection pools and do retries:

val keyspace = cluster.keyspace("MyCassieApp").connect()
// see KeyspaceBuilder for more options here. Try the defaults first.

(If you have some nodes with dramatically different latency—e.g., in another data center–or if you have a huge cluster, you can disable keyspace mapping via "mapHostsEvery(0.minutes)" in which case clients will connect directly to the seed hosts passed to "new Cluster".)

A Quick Note On Timestamps

Cassandra uses client-generated timestamps to determine the order in which writes and deletes should be processed. Cassie previously came with several different clock implementations. Now all Cassie users use the MicrosecondEpochClock and timestamps should be mostly hidden from users.

A Longer Note, This Time On Column Names And Values

Cassandra stores the name and value of a column as an array of bytes. To convert these bytes to and from useful Scala types, Cassie uses Codec parameters for the given type.

For example, take adding a column to a column family of UTF-8 strings:

val strings = keyspace.columnFamily[Utf8Codec, Utf8Codec, Utf8Codec]
strings.insert("newstring", Column("colname", "colvalue"))

The insert method here requires a String and Column[String, String] because the type parameters of the columnFamily call were all Codec[String]. The conversion between Strings and ByteArrays will be seamless. Cassie has codecs for a number of data types already:

  • Utf8Codec: character sequence encoded with UTF-8
  • IntCodec: 32-bit integer stored as a 4-byte sequence
  • LongCodec: 64-bit integer stored as an 8-byte sequence
  • LexicalUUIDCodec a UUID stored as a 16-byte sequence
  • ThriftCodec a Thrift struct stored as variable-length sequence of bytes

Accessing Column Families

Once you've got a Keyspace instance, you can load your column families:

val people  = keyspace.columnFamily[Utf8Codec, Utf8Codec, Utf8Codec]("People")
val numbers = keyspace.columnFamily[Utf8Codec, Utf8Codec, IntCodec]("People",
                defaultReadConsistency = ReadConsistency.One,
                defaultWriteConsistency = WriteConsistency.Any)

By default, ColumnFamily instances have a default ReadConsistency and WriteConsistency of Quorum, meaning reads and writes will only be considered successful if a quorum of the replicas for that key respond successfully. You can change this default or simply pass a different consistency level to specific read and write operations.

Reading Data From Cassandra

Now that you've got your ColumnFamily, you can read some data from Cassandra:

people.getColumn("codahale", "name")

getColumn returns an Future[Option[Column[Name, Value]]] where Name and Value are the type parameters of the ColumnFamily. If the row or column doesn't exist, None is returned. Explaining Futures is out of scope for this README, go the Finagle docs to learn more. But in essence you can do this:

people.getColumn("codahale", "name") map { _ match { case col: Some(Column[String, String]) => # we have data case None => # there was no column } } handle { case e => { # there was an exception, do something about it } }

This whole block returns a Future which will be satisfied when the thrift rpc is done and the callbacks have run.

Anyway, continuing– you can also get a set of columns:

people.getColumns("codahale", Set("name", "motto"))

This returns a Future[java.util.Map[Name, Column[Name, Value]]], where each column is mapped by its name.

If you want to get all columns of a row, that's cool too:

people.getRow("codahale")

Cassie also supports multiget for columns and sets of columns:

people.multigetColumn(Set("codahale", "darlingnikles"), "name")
people.multigetColumns(Set("codahale", "darlingnikles"), Set("name", "motto"))

multigetColumn returns a Future[Map[Key, Map[Name, Column[Name, Value]]]] whichmaps row keys to column names to columns.

Asynchronous Iteration Through Rows and Columns

NOTE: This is new/experimental and likely to change in the future.

Cassie provides functionality for iterating through the rows of a column family and columns in a row. This works with both the random partitioner and the order-preserving partitioner, though iterating through rows in the random partitioner had undefined order.

You can iterate over every column of every row:

val finished = cf.rowsIteratee(100).foreach { case(key, columns) => println(key) //this function is executed async for each row println(cols) } finished() //this is a Future[Unit]. wait on it to know when the iteration is done

This gets 100 rows at a time and calls the above partial function on each one.

Writing Data To Cassandra

Inserting columns is pretty easy:

people.insert("codahale", Column("name", "Coda"))
people.insert("codahale", Column("motto", "Moar lean."))

You can insert a value with a specific timestamp:

people.insert("darlingnikles", Column("name", "Niki").timestamp(200L))
people.insert("darlingnikles", Column("motto", "Told ya.").timestamp(201L))

Batch operations are also possible:

people.batch() { cf =>
  cf.insert("puddle", Column("name", "Puddle"))
  cf.insert("puddle", Column("motto", "Food!"))
}.execute()

(See BatchMutationBuilder for a better idea of which operations are available.)

Deleting Data From Cassandra

First, it's important to understand exactly how deletes work in a distributed system like Cassandra.

Once you've read that, then feel free to remove a column:

people.removeColumn("puddle", "name")

Or a set of columns:

people.removeColumns("puddle", Set("name", "motto"))

Or even a row:

people.removeRow("puddle")

Generating Unique IDs

If you're going to be storing data in Cassandra and don't have a naturally unique piece of data to use as a key, you've probably looked into UUIDs. The only problem with UUIDs is that they're mental, requiring access to MAC addresses or Gregorian calendars or POSIX ids. In general, people want UUIDs which are:

  • Unique across a large set of workers without requiring coordination.
  • Partially ordered by time.

Cassie's LexicalUUIDs meet these criteria. They're 128 bits long. The most significant 64 bits are a timestamp value (from Cassie's strictly-increasing Clock implementation). The least significant 64 bits are a worker ID, with the default value being a hash of the machine's hostname.

When sorted using Cassandra's LexicalUUIDType, LexicalUUIDs will be partially ordered by time -- that is, UUIDs generated in order on a single process will be totally ordered by time; UUIDs generated simultaneously (i.e., within the same clock tick, given clock skew) will not have a deterministic order; UUIDs generated in order between single processes (i.e., in different clock ticks, given clock skew) will be totally ordered by time.

See Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM (1978) vol. 21 (7) pp. 565 and Mattern. Virtual time and global states of distributed systems. Parallel and Distributed Algorithms (1989) pp. 215–226 for a more thorough discussion.

Things What Ain't Done Yet

  • Authentication
  • Meta data (e.g., describe_*)

Thanks

Many thanks to (pre twitter fork):

  • Cliff Moon
  • James Golick
  • Robert J. Macomber

License

Copyright (c) 2010 Coda Hale Copyright (c) 2011-2012 Twitter, Inc.

Published under The Apache 2.0 License, see LICENSE.

More Repositories

1

snowflake

Snowflake is a network service for generating unique ID numbers at high scale with some simple guarantees.
Scala
7,648
star
2

diffy

Find potential bugs in your services with Diffy
Scala
3,825
star
3

flockdb

A distributed, fault-tolerant graph database
Scala
3,337
star
4

kestrel

simple, distributed message queue system (inactive)
Scala
2,774
star
5

twui

A UI framework for Mac based on Core Animation
Objective-C
2,740
star
6

CocoaSPDY

SPDY for iOS and OS X
Objective-C
2,389
star
7

gizzard

[Archived] A flexible sharding framework for creating eventually-consistent distributed datastores
Scala
2,256
star
8

distributedlog

A high performance replicated log service. (The development is moved to Apache Incubator)
Java
2,224
star
9

recess

A simple and attractive code quality tool for CSS built on top of LESS
CSS
2,187
star
10

commons

Twitter common libraries for python and the JVM (deprecated)
Java
2,099
star
11

iago

A load generator, built for engineers
Scala
1,347
star
12

twitter-text-js

A JavaScript implementation of Twitter's text processing library
1,211
star
13

ambrose

A platform for visualization and real-time monitoring of data workflows
Java
1,181
star
14

twitter-kit-android

Twitter Kit for Android
Java
831
star
15

ostrich

A stats collector & reporter for Scala servers (deprecated)
Scala
773
star
16

twitter-kit-ios

Twitter Kit is a native SDK to include Twitter content inside mobile apps.
Objective-C
690
star
17

twitter-text-rb

A library that does auto linking and extraction of usernames, lists and hashtags in tweets
613
star
18

mysos

Cotton (formerly known as Mysos)
590
star
19

twitter-text-objc

An Objective-C implementation of Twitter's text processing library
587
star
20

torch-autograd

Autograd automatically differentiates native Torch code
Lua
559
star
21

ospriet

An example audience moderation app built on Twitter
JavaScript
408
star
22

cloudhopper-smpp

Efficient, scalable, and flexible Java implementation of the Short Messaging Peer to Peer Protocol (SMPP)
Java
382
star
23

twitter-text-java

A Java implementation of Twitter's text processing library
364
star
24

jvmgcprof

A simple utility for profile allocation and garbage collection activity in the JVM
C
342
star
25

css-flip

A CSS BiDi flipper
JavaScript
313
star
26

clockworkraven

Human-Powered Data Analysis with Mechanical Turk
Ruby
300
star
27

torch-twrl

Torch-twrl is a package that enables reinforcement learning in Torch.
Lua
251
star
28

twemperf

A tool for measuring memcached server performance
C
242
star
29

hdfs-du

Visualize your HDFS cluster usage
JavaScript
230
star
30

pycascading

A Python wrapper for Cascading
Python
222
star
31

RTLtextarea

Automatically detects RTL and configures a text input
JavaScript
169
star
32

haplocheirus

A Redis-backed storage engine for timelines
Scala
133
star
33

standard-project

A slightly more standard sbt project plugin library
Scala
132
star
34

torch-decisiontree

This project implements random forests and gradient boosted decision trees (GBDT). The latter uses gradient tree boosting. Both use ensemble learning to produce ensembles of decision trees (that is, forests).
Lua
129
star
35

elephant-twin

Elephant Twin is a framework for creating indexes in Hadoop
Java
96
star
36

torch-ipc

A set of primitives for parallel computation in Torch
C
95
star
37

torch-distlearn

A set of distributed learning algorithms for Torch
Lua
93
star
38

libcrunch

A lightweight mapping framework that maps data objects to a number of nodes, subject to constraints
Java
92
star
39

scribe

A Ruby client library for Scribe
Ruby
90
star
40

sbt-package-dist

sbt 11 plugin codifying best practices for building, packaging, and publishing
Scala
88
star
41

twisitor

A simple and spectacular photo-tweeting birdhouse
JavaScript
84
star
42

flockdb-client

A Ruby client library for FlockDB
Ruby
81
star
43

code-of-conduct

Open Source Code of Conduct at Twitter
80
star
44

twitter-text-conformance

Conformance testing data for the twitter-text-* repositories
77
star
45

torch-dataset

An extensible and high performance method of reading, sampling and processing data for Torch
Lua
76
star
46

cdk

CDK is a tool to quickly generate single-file html slide presentations from AsciiDoc
CSS
74
star
47

naggati2

Protocol builder for netty using scala (DEPRECATED)
Scala
74
star
48

twitter-kit-unity

Twitter Kit for Unity
C#
71
star
49

plumage.js

Batteries Included App Framework for Data Intensive UIs
JavaScript
66
star
50

gozer

Prototype mesos framework using new low-level API built in Go
Go
61
star
51

bookkeeper

Twitter's fork of Apache BookKeeper (will push changes upstream eventually)
Java
59
star
52

grabby-hands

A JVM Kestrel client that aggregates queues from multiple servers. Implemented in Scala with Java bindings. In use at Twitter for all JVM Search and Streaming Kestrel interactions.
Scala
56
star
53

gizzmo

A command-line client for Gizzard
Ruby
54
star
54

thrift

Twitter's out-of-date, forked thrift
C++
53
star
55

libkestrel

libkestrel
Scala
47
star
56

time_constants

Time constants, in seconds, so you don't have to use slow ActiveSupport helpers
Ruby
47
star
57

sbt-scrooge

An SBT plugin that adds a mixin for doing Thrift code auto-generation during your compile phase
Scala
44
star
58

cli-guide.js

CLI Guide JQuery Plugin
JavaScript
41
star
59

sbt-thrift

sbt rules for generating source stubs out of thrift IDLs, for java & scala
Ruby
38
star
60

jaqen

A type-safe heterogenous Map or a Named field Tuple
Scala
35
star
61

spitball

A very simple gem package generation tool built on bundler
Ruby
33
star
62

torch-thrift

A Thrift codec for Torch
C
29
star
63

jsr166e

JSR166e for Twitter
Java
27
star
64

unishark

Unishark: Another unittest extension for Python
Python
26
star
65

raggiana

A simple standalone Finagle stats viewer
JavaScript
21
star
66

sekhmet

foundational tools and building blocks for gaining insights and diagnosing system health in real-time
20
star
67

periscope-live-engagement-unity-sdk

Periscope Live Engagement Unity SDK
C#
20
star
68

twitterActors

Improved Scala actors library; used internally at Twitter
Scala
19
star
69

finatra-activator-http-seed

Typesafe activator template for constructing a Finatra HTTP server application:
Scala
18
star
70

killdeer

Killdeer is a simple server for replaying a sample of responses to sythentically recreate production response characteristics.
Scala
16
star
71

elephant-twin-lzo

Elephant Twin LZO uses Elephant Twin to create LZO block indexes
Java
15
star
72

bittern

Bittern Cache uses nvdimm to speed up block io operations
C
14
star
73

finatra-activator-thrift-seed

Typesafe activator template for constructing a Finatra Thrift server application: https://twitter.github.io/finatra/user-guide/ —
Scala
11
star
74

chainsaw

A thin Scala wrapper for SLF4J
Scala
10
star
75

PerfTracepoint

Perf tracepoint support for the JVM
Java
7
star
76

oscon-puzzles

OSCON 2014 Puzzle
JavaScript
7
star
77

scala-json

JSON in Scala (deprecated)
Scala
5
star
78

scala-csp-config

A Scala library for configuring Content Security Policy headers for HTTP responses.
Scala
4
star
79

.github

3
star
80

finatra-misc

Miscellaneous libraries and utils used by Finatra
Scala
3
star
81

autolog-clustering

USF Capstone Project for Auto-log Clustering
Python
1
star