• Stars
    star
    283
  • Rank 146,066 (Top 3 %)
  • Language
    Clojure
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Clojure dataframe library that runs on Spark

Geni (/gɜni/ or "gurney" without the r) is a Clojure dataframe library that runs on Apache Spark. The name means "fire" in Javanese.

CI Code Coverage Clojars Project License

Overview

Geni provides an idiomatic Spark interface for Clojure without the hassle of Java or Scala interop. Geni uses Clojure's -> threading macro as the main way to compose Spark's Dataset and Column operations in place of the usual method chaining in Scala. It also provides a greater degree of dynamism by allowing args of mixed types such as columns, strings and keywords in a single function invocation. See the docs section on Geni semantics for more details.

Resources

Docs Cookbook
  1. Getting Started with Clojure, Geni and Spark
  2. Reading and Writing Datasets
  3. Selecting Rows and Columns
  4. Grouping and Aggregating
  5. Combining Datasets with Joins and Unions
  6. String Operations
  7. Cleaning up Messy Data
  8. Timestamps and Dates
  9. Window Functions
  10. Reading from and Writing to SQL Databases
  11. Avoiding Repeated Computations with Caching
  12. Basic ML Pipelines
  13. Customer Segmentation with NMF

cljdoc slack zulip

Basic Examples

All examples below use the Statlib California housing prices data available for free on Kaggle.

Spark SQL API for data wrangling:

(require '[zero-one.geni.core :as g])

(def dataframe (g/read-parquet! "test/resources/housing.parquet"))

(g/count dataframe)
=> 5000

(g/print-schema dataframe)
; root
;  |-- longitude: double (nullable = true)
;  |-- latitude: double (nullable = true)
;  |-- housing_median_age: double (nullable = true)
;  |-- total_rooms: double (nullable = true)
;  |-- total_bedrooms: double (nullable = true)
;  |-- population: double (nullable = true)
;  |-- households: double (nullable = true)
;  |-- median_income: double (nullable = true)
;  |-- median_house_value: double (nullable = true)
;  |-- ocean_proximity: string (nullable = true)

(-> dataframe (g/limit 5) g/show)
; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
; |longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
; |-122.23  |37.88   |41.0              |880.0      |129.0         |322.0     |126.0     |8.3252       |452600.0          |NEAR BAY       |
; |-122.22  |37.86   |21.0              |7099.0     |1106.0        |2401.0    |1138.0    |8.3014       |358500.0          |NEAR BAY       |
; |-122.24  |37.85   |52.0              |1467.0     |190.0         |496.0     |177.0     |7.2574       |352100.0          |NEAR BAY       |
; |-122.25  |37.85   |52.0              |1274.0     |235.0         |558.0     |219.0     |5.6431       |341300.0          |NEAR BAY       |
; |-122.25  |37.85   |52.0              |1627.0     |280.0         |565.0     |259.0     |3.8462       |342200.0          |NEAR BAY       |
; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+

(-> dataframe (g/describe :housing_median_age :total_rooms :population) g/show)
; +-------+------------------+------------------+-----------------+
; |summary|housing_median_age|total_rooms       |population       |
; +-------+------------------+------------------+-----------------+
; |count  |5000              |5000              |5000             |
; |mean   |30.9842           |2393.2132         |1334.9684        |
; |stddev |12.969656616832669|1812.4457510408017|954.0206427949117|
; |min    |1.0               |1000.0            |100.0            |
; |max    |9.0               |999.0             |999.0            |
; +-------+------------------+------------------+-----------------+

(-> dataframe
    (g/group-by :ocean_proximity)
    (g/agg {:count        (g/count "*")
            :mean-rooms   (g/mean :total_rooms)
            :distinct-lat (g/count-distinct (g/int :latitude))})
    (g/order-by (g/desc :count))
    g/show)
; +---------------+-----+------------------+------------+
; |ocean_proximity|count|mean-rooms        |distinct-lat|
; +---------------+-----+------------------+------------+
; |INLAND         |1823 |2358.181020296215 |10          |
; |<1H OCEAN      |1783 |2467.5361749859785|7           |
; |NEAR BAY       |1287 |2368.72027972028  |2           |
; |NEAR OCEAN     |107  |2046.1869158878505|2           |
; +---------------+-----+------------------+------------+

(-> dataframe
    (g/select {:ocean :ocean_proximity
               :house (g/struct {:rooms (g/struct :total_rooms :total_bedrooms)
                                 :age   :housing_median_age})
               :coord (g/struct {:lat :latitude :long :longitude})})
    (g/limit 3)
    g/collect)
=> ({:ocean "NEAR BAY",
     :house {:rooms {:total_rooms 880.0, :total_bedrooms 129.0}, 
             :age 41.0},
     :coord {:lat 37.88, :long -122.23}}
    {:ocean "NEAR BAY",
     :house {:rooms {:total_rooms 7099.0, :total_bedrooms 1106.0}, 
             :age 21.0},
     :coord {:lat 37.86, :long -122.22}}
    {:ocean "NEAR BAY",
     :house {:rooms {:total_rooms 1467.0, :total_bedrooms 190.0}, 
             :age 52.0},
     :coord {:lat 37.85, :long -122.24}})

Spark ML example translated from Spark's programming guide:

(require '[zero-one.geni.core :as g])
(require '[zero-one.geni.ml :as ml])

(def training-set
  (g/table->dataset
    [[0 "a b c d e spark"  1.0]
     [1 "b d"              0.0]
     [2 "spark f g h"      1.0]
     [3 "hadoop mapreduce" 0.0]]
    [:id :text :label]))

(def pipeline
  (ml/pipeline
    (ml/tokenizer {:input-col :text
                   :output-col :words})
    (ml/hashing-tf {:num-features 1000
                    :input-col :words
                    :output-col :features})
    (ml/logistic-regression {:max-iter 10
                             :reg-param 0.001})))

(def model (ml/fit training-set pipeline))

(def test-set
  (g/table->dataset
    [[4 "spark i j k"]
     [5 "l m n"]
     [6 "spark hadoop spark"]
     [7 "apache hadoop"]]
    [:id :text]))

(-> test-set
    (ml/transform model)
    (g/select :id :text :probability :prediction)
    g/show)
;; +---+------------------+----------------------------------------+----------+
;; |id |text              |probability                             |prediction|
;; +---+------------------+----------------------------------------+----------+
;; |4  |spark i j k       |[0.1596407738787411,0.8403592261212589] |1.0       |
;; |5  |l m n             |[0.8378325685476612,0.16216743145233883]|0.0       |
;; |6  |spark hadoop spark|[0.0692663313297627,0.9307336686702373] |1.0       |
;; |7  |apache hadoop     |[0.9821575333444208,0.01784246665557917]|0.0       |
;; +---+------------------+----------------------------------------+----------+

More detailed examples can be found here.

Quick Start

Install Geni

Install the geni script to /usr/local/bin with:

wget https://raw.githubusercontent.com/zero-one-group/geni/develop/scripts/geni
chmod a+x geni
sudo mv geni /usr/local/bin/

The command geni downloads the latest Geni uberjar and places it in ~/.geni/geni-repl-uberjar.jar, and runs it with java -jar.

Uberjar

Download the latest Geni REPL uberjar from the release page. Run the uberjar as follows:

java -jar <uberjar-name>

The uberjar app prints the default SparkSession instance, starts an nREPL server with an .nrepl-port file for easy text-editor connection and steps into a Clojure REPL(-y).

Leiningen Template

Use Leiningen to create a template of a Geni project:

lein new geni <project-name>

cd into the project directory and do lein run. The templated app runs a Spark ML example, and then steps into a Clojure REPL-y with an .nrepl-port file.

Screencast Demos

Install Uberjar Leiningen

Installation

Add the following to your project.clj dependency:

Clojars Project

You would also need to add Spark as provided dependencies. For instance, have the following key-value pair for the :profiles map:

:provided
{:dependencies [;; Spark
                [org.apache.spark/spark-avro_2.12 "3.1.1"]
                [org.apache.spark/spark-core_2.12 "3.1.1"]
                [org.apache.spark/spark-hive_2.12 "3.1.1"]
                [org.apache.spark/spark-mllib_2.12 "3.1.1"]
                [org.apache.spark/spark-sql_2.12 "3.1.1"]
                [org.apache.spark/spark-streaming_2.12 "3.1.1"]
                [com.github.fommil.netlib/all "1.1.2" :extension "pom"]
                ; Arrow
                [org.apache.arrow/arrow-memory-netty "2.0.0"]
                [org.apache.arrow/arrow-memory-core "2.0.0"]
                [org.apache.arrow/arrow-vector "2.0.0"
                :exclusions [commons-codec com.fasterxml.jackson.core/jackson-databind]]
                ;; Databases
                [mysql/mysql-connector-java "8.0.23"]
                [org.postgresql/postgresql "42.2.19"]
                [org.xerial/sqlite-jdbc "3.34.0"]
                ;; Optional: Spark XGBoost
                [ml.dmlc/xgboost4j-spark_2.12 "1.2.0"]
                [ml.dmlc/xgboost4j_2.12 "1.2.0"]]}

You may also need to install libatlas3-base and libopenblas-base to use a native BLAS, and install libgomp1 to train XGBoost4J models. When the optional dependencies are not present, the vars to the corresponding functions (such as ml/xgboost-classifier) will be left unbound.

License

Copyright 2020 Zero One Group.

Geni is licensed under Apache License v2.0, see LICENSE.

Mentions

Some parts of the project have been taken from or inspired by:

More Repositories

1

welcome-entry-level

Welcome! This repo hosts a basic set of guidelines for Indonesian students and fresh graduates to land their first tech job at Zero One Group!
278
star
2

fxl

fxl is a Clojure spreadsheet library
Clojure
128
star
3

fxl.js

ƛ fxl.js is a data-oriented JavaScript spreadsheet library. It provides a way to build spreadsheets using modular, lego-like blocks.
TypeScript
31
star
4

geni-performance-benchmark

Clojure
27
star
5

zog-ui

This repository contains monorepo for React and Flutter UI library. (WIP - Alpha Release)
Dart
20
star
6

gitlab-migrator

Rust
8
star
7

injecto

Elixir
8
star
8

cljs-google-cloud-function

A clojurescript google cloud function starter
JavaScript
7
star
9

global-coal-countdown

This is the accompanying repository for the Bloomberg Global Coal Countdown website.
Python
7
star
10

ionic-react-native-flutter

Comparison of Ionic, React Native, and Flutter in one monorepo because why not?
TypeScript
6
star
11

mentoring

🚀 A JavaScript ecosystem mentoring X Surabaya JS
6
star
12

fun-gcp

Having fun with Google Cloud Platform using Clojure
Clojure
5
star
13

geni-template

Clojure
4
star
14

form-builder

Simplify large form
TypeScript
4
star
15

hayu

Zero One Group application scaffolding tool.
Rust
4
star
16

fungsi

A utility library for FP in Go using Generics.
Go
3
star
17

docker-images

Dockerfile
3
star
18

customer-segmentation-example

Clojure
3
star
19

fullstack-go

Exploration of fullstack app using Go
Go
3
star
20

zot-internship

Python
3
star
21

agritable

Manage your farm as easy as using spreadsheet
TypeScript
2
star
22

nx-flutter-fastify-terraform

An example monorepo project by Zero One Group
TypeScript
2
star
23

anti-food-waste

An anti food waste App 🍕 ♻️ using Nx, Ionic, Capacitor, Fastify, and Prisma
TypeScript
2
star
24

monorepo

Zero One Group way to handle Monorepo.
TypeScript
2
star
25

flask-ddd-example

An example for Flask Domain Driven Design Skeleton
Python
2
star
26

geni-repl-demo

Clojure
1
star
27

babashka-cloud-run-examples

Clojure
1
star
28

tailwind-vs-chakra-ui

Comparison of Tailwind and Chakra UI
1
star
29

blog-cms

A blog CMS 📄 using Nx, React, Fastify.
TypeScript
1
star
30

webgl-animation

A webGL music visualization
TypeScript
1
star
31

poc-tauri-python

Proof of Concept using Python as Tauri sidecar.
HTML
1
star