• Stars
    star
    129
  • Rank 269,201 (Top 6 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated 12 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Typesafe wrapper for Apache Spark DataFrame API

Iskra

Iskra is a Scala 3 wrapper library around Apache Spark API which allows writing typesafe and boilerplate-free but still efficient Spark code.

How is it possible to write Spark applications in Scala 3?

Starting from the release of 3.2.0, Spark is cross-compiled also for Scala 2.13, which opens a way to using Spark from Scala 3 code, as Scala 3 projects can depend on Scala 2.13 artifacts.

However, one might run into problems when trying to call a method requiring an implicit instance of Spark's Encoder type. Derivation of instances of Encoder relies on presence of a TypeTag for a given type. However TypeTags are not generated by Scala 3 compiler anymore (and there are no plans to support this) so instances of Encoder cannot be automatically synthesized in most cases.

Iskra tries to work around this problem by using its own encoders (unrelated to Spark's Encoder type) generated using Scala 3's new metaprogramming API.

How does Iskra make things typesafe and efficient at the same time?

Iskra provides thin (but strongly typed) wrappers around DataFrames, which track types and names of columns at compile time but let Catalyst perform all of its optimizations at runtime.

Iskra uses structural types rather than case classes as data models, which gives us a lot of flexibility (no need to explicitly define a new case class when a column is added/removed/renamed!) but we still get compilation errors when we try to refer to a column which doesn't exist or can't be used in a given context.

Usage

⚠️ This library is in its early stage of development - the syntax and type hierarchy might still change, the coverage of Spark's API is far from being complete and more tests are needed.

  1. Add Iskra as a dependency to your project, e.g.
  • in a file compiled with Scala CLI:
//> using lib "org.virtuslab::iskra:0.0.3"
  • when starting Scala CLI REPL:
scala-cli repl --dep org.virtuslab::iskra:0.0.3
  • in build.sbt in an sbt project:
libraryDependencies += "org.virtuslab" %% "iskra" % "0.0.3"

Iskra is built with Scala 3.1.3 so it's compatible with Scala 3.1.x and newer minor releases (starting from 3.2.0 you'll get code completions for names of columns in REPL and Metals!). Iskra transitively depends on Spark 3.2.0.

  1. Import the basic definitions from the API
import org.virtuslab.iskra.api.*
  1. Get a Spark session, e.g.
given spark: SparkSession =
  SparkSession
    .builder()
    .master("local")
    .appName("my-spark-app")
    .getOrCreate()
  1. Create a typed data frame in either of the two ways:
  • by using toTypedDF extension method on a Seq of case classes, e.g.
Seq(Foo(1, "abc"), Foo(2, "xyz")).toTypedDF
  • by taking a good old (untyped) data frame and calling typed extension method on it with a type parameter representing a case class, e.g.
df.typed[Foo]

In case you needed to get back to the unsafe world of untyped data frames for some reason, just call .untyped on a typed data frame.

  1. Follow your intuition of a Spark developer πŸ˜‰

This library intends to maximally resemble the original API of Spark (e.g. by using the same names of methods, etc.) where possible, although trying to make the code feel more like regular Scala without unnecessary boilerplate and adding some other syntactic improvements.

Most important differences:

  • Refer to columns (also with prefixes specifying the alias for a dataframe in case of ambiguities) simply with $.foo.bar instead of $"foo.bar" or col("foo.bar"). Use backticks when necessary, e.g. $.`column with spaces` .
  • From inside of .select(...) or .select{...} you should return something that is a named column or a tuple of named columns. Because of how Scala syntax works you can write simply .select($.x, $.y) instead of select(($.x, $.y)). With braces you can compute intermediate values like
.select {
  val sum = ($.x + $.y).as("sum")
  ($.x, $.y, sum)
}
  • Syntax for joins looks slightly more like SQL, but with dots and parentheses as for usual method calls, e.g.
foos.innerJoin(bars).on($.foos.barId === $.bars.id).select(...)
  • As you might have noticed above, the aliases for foos and bars were automatically inferred
  1. For reference look at the examples and the API docs

Local development

This project is built using scala-cli so just use the traditional commands with . as root like scala-cli compile . or scala-cli test ..

For a more recent version of Usage section look here

More Repositories

1

git-machete

Probably the sharpest git repository organizer & rebase/merge workflow automation tool you've ever seen
Python
828
star
2

scala-cli

Scala CLI is a command-line tool to interact with the Scala language. It lets you compile, run, test, and package your Scala code (and more!)
Scala
500
star
3

graphbuddy

Graph Buddy helps you to understand the code better
HTML
143
star
4

render

Universal data-driven template for generating textual output, as a static binary and a library
Go
139
star
5

git-machete-intellij-plugin

Probably the sharpest git repository organizer & rebase/merge workflow automation tool you've ever seen
Java
128
star
6

pandas-stubs

Pandas type stubs. Helps you type-check your code.
Python
117
star
7

unicorn

Small Slick library for type-safe id handling
Scala
112
star
8

Inkuire

Hoogle-like searches for Scala 3 and Kotlin
Scala
90
star
9

besom

Besom - a Pulumi SDK for Scala. Also, incidentally, a broom made of twigs tied round a stick. Brooms and besoms are used for protection, to ward off evil spirits, and cleansing of ritual spaces.
Scala
89
star
10

avocADO

Safe compile-time parallelization of for-comprehensions for Scala 3
Scala
83
star
11

jenkins-operator

Kubernetes native Jenkins Operator, moved to https://github.com/jenkinsci/kubernetes-operator
Go
82
star
12

scala-yaml

Scala
75
star
13

pretty-stacktraces

Scala
57
star
14

bazel-steward

A bot to keep Bazel dependencies up to date
Kotlin
56
star
15

tetrisly-react

Tetrisly offers user-friendly components designed for effortless integration. Plus, it's fully compatible with Tetrisly for Figma with a seamless design and development experience in mind.
TypeScript
39
star
16

crypt

Universal cryptographic tool with AWS KMS, GCP KMS, GnuPG and Azure Key Vault support
Go
32
star
17

infrastructure-as-types

Infrastructure as Types - modern infrastructure declaration and deployment toolkit
Scala
26
star
18

ide-probe

Scala
26
star
19

beholder

Small slick lib for create views on on database
Scala
26
star
20

akka-serialization-helper

Serialization toolbox for Akka messages, events and persistent state that helps achieve compile-time guarantee on serializability. No more errors in the runtime!
Scala
26
star
21

activator-play-advanced-slick

Typesafe Activator template for advanced play-slick project
HTML
20
star
22

kubedrainer

Simple Kubernetes Node Drainer
Go
20
star
23

strapi-plugin-content-manager-extension-hierarchical

strapi-plugin-content-manager-extension-hierarchical
JavaScript
16
star
24

contextbuddy

Platform documentation
CSS
15
star
25

scala-packager

Scala
12
star
26

using_directives

Java
12
star
27

community-build3

Scala
11
star
28

vss

Scala
10
star
29

coursier-m1

A small repo to release coursier using self-hosted Mac M1 runner
Shell
10
star
30

ReactSphere-reactive-beyond-hype

Repo for presentation on ReactSphere: Reactive beyond hype
HCL
8
star
31

akka-workshop-client

Base code for akka workshop.
Scala
8
star
32

pekko-serialization-helper

Serialization toolbox for Pekko messages, events and persistent state that helps achieve compile-time guarantee on serializability. No more errors in the runtime!
Scala
8
star
33

scala-snippet-checker

TypeScript
7
star
34

scala-compose

Scala
6
star
35

pyspark-workshop

HTML
6
star
36

codetale

CodeTale - documentation & issue tracking
Dockerfile
6
star
37

genesis

Common sbt settings for sbt-based projects
Scala
5
star
38

data_lake_navigation_atlas

Code for blogpost Navigation in the data lake using Atlas
Scala
5
star
39

kleisli-examples

Examples from blog post on Kleisli arrows
Scala
5
star
40

ddd-public-materials

All public materials for community prepared by The DDD guild from Virtuslab
Kotlin
5
star
41

scg-cli

scg-cli is a CLI tool for Semantic Code Graph analysis
Scala
5
star
42

tips

CSS
4
star
43

go-extended

Things missing or not belonging in the standard library
Go
3
star
44

mesos-on-vagrant

Just a Vagrant file and Ansible playbook for deploying Mesos cluster for testing
3
star
45

scala-workshop

Scala
3
star
46

talk-scala-akka-play

Introductory talk to Scala, Akka and Play! framework
JavaScript
3
star
47

base-types-kt

Library with common types for Kotlin supporting domain-driven functional programming
Kotlin
3
star
48

Edison-BlinkOnboard

Example Akka system for blinking LED at Intel Edison
Scala
3
star
49

jira-stats

Exports some metrics from jira via REST api - currently calculated dev days per ticket
Go
2
star
50

runscope-agent

Containerized Runscope Agent (Dockerfile)
Makefile
2
star
51

homebrew-cloud

This repository contains a collection of Homebrew formulas.
Ruby
2
star
52

scala-cli-packages

Shell
2
star
53

ide-probe-tests

Scala
2
star
54

spark_sql_under_the_hood

Code for blogpost: Spark SQL under the hood
Scala
2
star
55

homebrew-scala-cli

Ruby
1
star
56

scalacamp

ScalaCamp.pl site source code
HTML
1
star
57

homebrew-git-machete

1
star
58

gpki

Git Public Key Infrastructure
Python
1
star
59

scala-workshop-bootstrap

Shell
1
star
60

jenkins-operator-assets

Hosting Jenkins Operator assests like images or CSS files.
CSS
1
star
61

virtusity-workshop-graphql

TypeScript
1
star
62

dokka-site

Kotlin
1
star
63

kibana-rpm-packaging

1
star
64

kubectl-deploy

Simple kubectl plugin for rendering and applying Kubernetes manifests
Go
1
star
65

scala-cli.g8

Scala
1
star
66

akka-http-kubernetes.g8

Scala
1
star
67

homebrew-scala-experimental

Ruby
1
star
68

scg-scala

Scala compiler plugin for Semantic Code Graph generation
Scala
1
star
69

Ariadne-Bootloader

A little less unfinished TFTP bootloader for Arduino Leonardo Ethernet
Arduino
1
star
70

ScalaTastiesScrapper

Scala
1
star
71

aws-cli

Yet Another Dockerized AWS CLI
Makefile
1
star
72

shuttlecraft

Scala
1
star