• Stars
    star
    1,929
  • Rank 23,031 (Top 0.5 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created almost 10 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

DataStax Connector for Apache Spark to Apache Cassandra

Spark Cassandra Connector

CI

Quick Links

What Where
Community Chat with us at Datastax and Cassandra Q&A
Scala Docs Most Recent Release (3.3.0): Spark-Cassandra-Connector, Spark-Cassandra-Connector-Driver
Latest Production Release 3.3.0

Features

Lightning-fast cluster computing with Apache Spark™ and Apache Cassandra®.

This library lets you expose Cassandra tables as Spark RDDs and Datasets/DataFrames, write Spark RDDs and Datasets/DataFrames to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.

  • Compatible with Apache Cassandra version 2.1 or higher (see table below)
  • Compatible with Apache Spark 1.0 through 3.3 (see table below)
  • Compatible with Scala 2.11 and 2.12
  • Exposes Cassandra tables as Spark RDDs and Datasets/DataFrames
  • Maps table rows to CassandraRow objects or tuples
  • Offers customizable object mapper for mapping rows to objects of user-defined classes
  • Saves RDDs back to Cassandra by implicit saveToCassandra call
  • Delete rows and columns from cassandra by implicit deleteFromCassandra call
  • Join with a subset of Cassandra data using joinWithCassandraTable call for RDDs, and optimizes join with data in Cassandra when using Datasets/DataFrames
  • Partition RDDs according to Cassandra replication using repartitionByCassandraReplica call
  • Converts data types between Cassandra and Scala
  • Supports all Cassandra data types including collections
  • Filters rows on the server side via the CQL WHERE clause
  • Allows for execution of arbitrary CQL statements
  • Plays nice with Cassandra Virtual Nodes
  • Could be used in all languages supporting Datasets/DataFrames API: Python, R, etc.

Version Compatibility

The connector project has several branches, each of which map into different supported versions of Spark and Cassandra. For previous releases the branch is named "bX.Y" where X.Y is the major+minor version; for example the "b1.6" branch corresponds to the 1.6 release. The "master" branch will normally contain development for the next connector release in progress.

Currently, the following branches are actively supported: 3.3.x (master), 3.2.x (b3.2), 3.1.x (b3.1), 3.0.x (b3.0) and 2.5.x (b2.5).

Connector Spark Cassandra Cassandra Java Driver Minimum Java Version Supported Scala Versions
3.3 3.3 2.1.5*, 2.2, 3.x, 4.x 4.13 8 2.12
3.2 3.2 2.1.5*, 2.2, 3.x, 4.0 4.13 8 2.12
3.1 3.1 2.1.5*, 2.2, 3.x, 4.0 4.12 8 2.12
3.0 3.0 2.1.5*, 2.2, 3.x, 4.0 4.12 8 2.12
2.5 2.4 2.1.5*, 2.2, 3.x, 4.0 4.12 8 2.11, 2.12
2.4.2 2.4 2.1.5*, 2.2, 3.x 3.0 8 2.11, 2.12
2.4 2.4 2.1.5*, 2.2, 3.x 3.0 8 2.11
2.3 2.3 2.1.5*, 2.2, 3.x 3.0 8 2.11
2.0 2.0, 2.1, 2.2 2.1.5*, 2.2, 3.x 3.0 8 2.10, 2.11
1.6 1.6 2.1.5*, 2.2, 3.0 3.0 7 2.10, 2.11
1.5 1.5, 1.6 2.1.5*, 2.2, 3.0 3.0 7 2.10, 2.11
1.4 1.4 2.1.5* 2.1 7 2.10, 2.11
1.3 1.3 2.1.5* 2.1 7 2.10, 2.11
1.2 1.2 2.1, 2.0 2.1 7 2.10, 2.11
1.1 1.1, 1.0 2.1, 2.0 2.1 7 2.10, 2.11
1.0 1.0, 0.9 2.0 2.0 7 2.10, 2.11

*Compatible with 2.1.X where X >= 5

Hosted API Docs

API documentation for the Scala and Java interfaces are available online:

3.3.0

3.2.0

3.1.0

3.0.1

2.5.2

2.4.2

Download

This project is available on the Maven Central Repository. For SBT to download the connector binaries, sources and javadoc, put this in your project SBT config:

libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "3.3.0"
  • The default Scala version for Spark 3.0+ is 2.12 please choose the appropriate build. See the FAQ for more information.

Building

See Building And Artifacts

Documentation

Online Training

DataStax Academy

DataStax Academy provides free online training for Apache Cassandra and DataStax Enterprise. In DS320: Analytics with Spark, you will learn how to effectively and efficiently solve analytical problems with Apache Spark, Apache Cassandra, and DataStax Enterprise. You will learn about Spark API, Spark-Cassandra Connector, Spark SQL, Spark Streaming, and crucial performance optimization techniques.

Community

Reporting Bugs

New issues may be reported using JIRA. Please include all relevant details including versions of Spark, Spark Cassandra Connector, Cassandra and/or DSE. A minimal reproducible case with sample code is ideal.

Mailing List

Questions and requests for help may be submitted to the user mailing list.

Q/A Exchange

The DataStax Community provides a free question and answer website for any and all questions relating to any DataStax Related technology. Including the Spark Cassandra Connector. Both DataStax engineers and community members frequent this board and answer questions.

Contributing

To protect the community, all contributors are required to sign the DataStax Spark Cassandra Connector Contribution License Agreement. The process is completely electronic and should only take a few minutes.

To develop this project, we recommend using IntelliJ IDEA. Make sure you have installed and enabled the Scala Plugin. Open the project with IntelliJ IDEA and it will automatically create the project structure from the provided SBT configuration.

Tips for Developing the Spark Cassandra Connector

Checklist for contributing changes to the project:

  • Create a SPARKC JIRA
  • Make sure that all unit tests and integration tests pass
  • Add an appropriate entry at the top of CHANGES.txt
  • If the change has any end-user impacts, also include changes to the ./doc files as needed
  • Prefix the pull request description with the JIRA number, for example: "SPARKC-123: Fix the ..."
  • Open a pull-request on GitHub and await review

Testing

To run unit and integration tests:

./sbt/sbt test
./sbt/sbt it:test

Note that the integration tests require CCM to be installed on your machine. See Tips for Developing the Spark Cassandra Connector for details.

By default, integration tests start up a separate, single Cassandra instance and run Spark in local mode. It is possible to run integration tests with your own Cassandra and/or Spark cluster. First, prepare a jar with testing code:

./sbt/sbt test:package

Then copy the generated test jar to your Spark nodes and run:

export IT_TEST_CASSANDRA_HOST=<IP of one of the Cassandra nodes>
export IT_TEST_SPARK_MASTER=<Spark Master URL>
./sbt/sbt it:test

Generating Documents

To generate the Reference Document use

./sbt/sbt spark-cassandra-connector-unshaded/run (outputLocation)

outputLocation defaults to doc/reference.md

License

Copyright 2014-2022, DataStax, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

python-driver

DataStax Python Driver for Apache Cassandra
Python
1,371
star
2

nodejs-driver

DataStax Node.js Driver for Apache Cassandra
JavaScript
1,227
star
3

csharp-driver

DataStax C# Driver for Apache Cassandra
C#
623
star
4

php-driver

[MAINTENANCE ONLY] DataStax PHP Driver for Apache Cassandra
C
433
star
5

cpp-driver

DataStax C/C++ Driver for Apache Cassandra
C++
390
star
6

cass-operator

The DataStax Kubernetes Operator for Apache Cassandra
Go
256
star
7

ruby-driver

[MAINTENANCE ONLY] DataStax Ruby Driver for Apache Cassandra
Ruby
227
star
8

graph-book

The Code Examples and Notebooks for The Practitioners Guide to Graph Data
Shell
187
star
9

cql-proxy

A client-side CQL proxy/sidecar.
Go
170
star
10

metric-collector-for-apache-cassandra

Drop-in metrics collection and dashboards for Apache Cassandra
Java
106
star
11

ragstack-ai

RAGStack is an out of the box solution simplifying Retrieval Augmented Generation (RAG) in AI apps.
Python
85
star
12

dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Java
76
star
13

docker-images

Docker images published by DataStax.
Shell
73
star
14

dynamo-cassandra-proxy

Preview version of an open source tool that enables developers to run their AWS DynamoDB™ workloads on Apache Cassandra™. With the proxy, developers can run DynamoDB workloads outside of AWS (including on premises, other clouds, and in hybrid configurations).
Java
73
star
15

cstar_perf

Apache Cassandra performance testing platform
Python
72
star
16

zdm-proxy

An open-source component designed to seamlessly handle the real-time client application activity while a migration is in progress.
Go
71
star
17

ai-chatbot-starter

A starter app to build AI powered chat bots with Astra DB and LlamaIndex
Python
61
star
18

zdm-proxy-automation

An Ansible-based automation suite to deploy and manage the Zero Downtime Migration Proxy
Go
59
star
19

graph-examples

Java
52
star
20

fallout

Distributed System Testing as a Service
Java
51
star
21

pulsar-jms

DataStax Starlight for JMS, a JMS API for Apache Pulsar ®
Java
47
star
22

reactive-pulsar

Reactive Streams adapter for Apache Pulsar Java Client
Java
47
star
23

astra-assistants-api

A backend implementation of the OpenAI beta Assistants API
Python
47
star
24

pulsar-helm-chart

Apache Pulsar Helm chart
Mustache
46
star
25

kafka-examples

Examples of using the DataStax Apache Kafka Connector.
Java
45
star
26

cassandra-quarkus

An Apache Cassandra(R) extension for Quarkus
Java
39
star
27

wikichat

Python
38
star
28

kaap

KAAP, Kubernetes Autoscaling for Apache Pulsar
Java
34
star
29

sstable-to-arrow

Java
33
star
30

simulacron

Simulacron - An Apache Cassandra® Native Protocol Server Simulator
Java
32
star
31

cdc-apache-cassandra

Datastax CDC for Apache Cassandra
Java
32
star
32

pulsar-admin-console

Pulsar Admin Console is a web based UI that administrates topics, namespaces, sources, sinks and various aspects of Apache Pulsar features.
Vue
32
star
33

ragbot-starter

An Astra DB and OpenAI chatbot
TypeScript
32
star
34

code-samples

Code samples from DataStax
Scala
31
star
35

astra-cli

Command Line Interface for DataStax Astra
Java
30
star
36

diagnostic-collection

Set of scripts for collection of diagnostic information from DSE/Cassandra clusters
Python
28
star
37

starlight-for-rabbitmq

Starlight for RabbitMQ, a proxy layer between RabbitMQ/AMQP0.9.1 clients and Apache Pulsar
Java
27
star
38

dse-metric-reporter-dashboards

Prometheus & Grafana dashboards for DSE metric collector
Python
27
star
39

SwiftieGPT

TypeScript
27
star
40

spark-cassandra-stress

A tool for testing the DataStax Spark Connector against Apache Cassandra or DSE
Scala
26
star
41

cla-enforcer

A Contributor License Agreement enforcement bot
Ruby
25
star
42

pulsar-heartbeat

Pulsar Heartbeat monitors Pulsar cluster availability, tracks latency of Pulsar message pubsub, and reports failures of the Pulsar cluster. It produces synthetic workloads to measure end-to-end message pubsub latency.
Go
23
star
43

cassandra-data-migrator

Cassandra Data Migrator - Migrate & Validate data between origin and target Apache Cassandra®-compatible clusters.
Java
22
star
44

cassandra-log4j-appender

Cassandra appenders for Log4j
Java
20
star
45

cassandra-data-apis

Data APIs for Apache Cassandra
Go
19
star
46

labs

DataStax Labs preview program
Java
19
star
47

terraform-provider-astra

A project that allows DataStax Astra users to manage their full database lifecycle for Astra Serverless databases (built on Apache Cassandra(TM)) using Terraform
Go
18
star
48

dc-failover-demo

Fault Tolerant Applications with Apache Cassandra™ Demo
HCL
17
star
49

astra-sdk-java

Set of client side libraries to help with Astra Platform usage
Java
17
star
50

kafka-sink

Apache Kafka® sink for transferring events/messages from Kafka topics to Apache Cassandra®, DataStax Astra and DataStax Enterprise (DSE).
Java
17
star
51

starlight-for-kafka

DataStax - Starlight for Kafka
Java
15
star
52

astrajs

A monorepo containing tools for interacting with DataStax Astra and Stargate
JavaScript
15
star
53

native-protocol

An implementation of the Apache Cassandra® native protocol
Java
14
star
54

astrapy

AstraPy is a Pythonic interface for DataStax Astra DB and the Data API
Python
14
star
55

block-explorer

TypeScript
12
star
56

go-cassandra-native-protocol

Cassandra Native Protocol bindings for the Go language
Go
12
star
57

cassandra-reactive-demo

A demo application that interacts with Apache Cassandra(R) using the Java driver 4.4+ and reactive programming
Java
11
star
58

pulsar-sink

An Apache Pulsar® sink for transferring events/messages from Pulsar topics to Apache Cassandra®, DataStax Astra or DataStax Enterprise (DSE) tables.
Java
11
star
59

adelphi

Automation tool for testing C* OSS that assembles cassandra-diff, nosqlbench, fqltool
Python
9
star
60

pulsar-transformations

Java
9
star
61

gatling-dse-plugin

Scala
8
star
62

snowflake-connector

Datastax Snowflake Sink Connector for Apache Pulsar
Java
8
star
63

gocql-astra

Support for gocql on Astra
Go
8
star
64

dsbulk-migrator

Java
8
star
65

release-notes

Release Notes for DataStax Products
8
star
66

vault-plugin-secrets-datastax-astra

HashiCorp Vault Plugin for Datstax Astra
Go
8
star
67

pulsar-3rdparty-connector

This project provides simple templates and instructions to build Apache Pulsar connectors on base of the existing Apache Kafka connectors.
Shell
8
star
68

dsbench-labs

DSBench - A Database Testing Power Tool
7
star
69

remote-junit-runner

JUnit runner that executes tests in a remote JVM
Java
7
star
70

cass-config-builder

Configuration builder for Apache Cassandra based on definitions at datastax/cass-config-definitions
Clojure
7
star
71

java-driver-scala-extras

Scala extensions and utilities for the DataStax Java Driver
Scala
6
star
72

burnell

A proxy to Pulsar cluster
Go
6
star
73

gatling-dse-stress

Scala
5
star
74

astra-client-go

Go
5
star
75

gatling-dse-simcatalog

Scala
4
star
76

java-quotient-filter

A Java Quotient Filter implementation.
Java
4
star
77

pulsar-ansible

Shell
4
star
78

astra-db-ts

Typescript client for Astra DB Vector
TypeScript
4
star
79

terraform-helm-oci-release

HCL
3
star
80

ds-support-diagnostic-collection

Scripts for collection of diagnostic information from DSE/Cassandra clusters running on various platforms
Shell
3
star
81

go-cassandra-simple-client

A simple Go client for the Cassandra native protocol
3
star
82

cass-config-definitions

Shell
3
star
83

astra-ide-plugin

Kotlin
3
star
84

charts

DataStax Helm Charts
Shell
3
star
85

astra-db-chatbot-starter

Python
2
star
86

java-driver-examples-osgi

Examples showing the usage of the DataStax Java driver in OSGi applications.
Java
2
star
87

nodejs-driver-graph

DataStax Node.js Driver Extensions for DSE Graph
JavaScript
2
star
88

aws-secrets-manager-integration-astra

Python
2
star
89

starlight-for-grpc

Java
2
star
90

astra-streaming-examples

Java
2
star
91

homebrew-luna-streaming-shell

Shell
2
star
92

astra-block-examples

Various Astra Block Examples
TypeScript
2
star
93

cassandra-drivers-smoke-test

Smoke tests for Apache Cassandra using the DataStax Drivers
Shell
2
star
94

junitpytest

JUnit5 plugin to run pytest via Gradle
Java
2
star
95

migration-docs

JavaScript
2
star
96

venice-helm-chart

Smarty
2
star
97

spark-cassandra-connector-devtools

Extra stuff useful for development of spark-cassandra-connector e.g. performance tests
2
star
98

cpp-dse-driver-examples

Examples for using the DataStax C/C++ Enterprise Driver
C
2
star
99

venice

Java
1
star
100

fallout-tests

Python
1
star