• Stars
    star
    136
  • Rank 267,670 (Top 6 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created about 3 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Typesafe wrapper for Apache Spark DataFrame API

Iskra

Iskra is a Scala 3 wrapper library around Apache Spark API which allows writing typesafe and boilerplate-free but still efficient Spark code.

How is it possible to write Spark applications in Scala 3?

Starting from the release of 3.2.0, Spark is cross-compiled also for Scala 2.13, which opens a way to using Spark from Scala 3 code, as Scala 3 projects can depend on Scala 2.13 artifacts.

However, one might run into problems when trying to call a method requiring an implicit instance of Spark's Encoder type. Derivation of instances of Encoder relies on presence of a TypeTag for a given type. However TypeTags are not generated by Scala 3 compiler anymore (and there are no plans to support this) so instances of Encoder cannot be automatically synthesized in most cases.

Iskra tries to work around this problem by using its own encoders (unrelated to Spark's Encoder type) generated using Scala 3's new metaprogramming API.

How does Iskra make things typesafe and efficient at the same time?

Iskra provides thin (but strongly typed) wrappers around DataFrames, which track types and names of columns at compile time but let Catalyst perform all of its optimizations at runtime.

Iskra uses structural types rather than case classes as data models, which gives us a lot of flexibility (no need to explicitly define a new case class when a column is added/removed/renamed!) but we still get compilation errors when we try to refer to a column which doesn't exist or can't be used in a given context.

Usage

⚠️ This library is in its early stage of development - the syntax and type hierarchy might still change, the coverage of Spark's API is far from being complete and more tests are needed.

  1. Add Iskra as a dependency to your project, e.g.
  • in a file compiled with Scala CLI:
//> using lib "org.virtuslab::iskra:0.0.3"
  • when starting Scala CLI REPL:
scala-cli repl --dep org.virtuslab::iskra:0.0.3
  • in build.sbt in an sbt project:
libraryDependencies += "org.virtuslab" %% "iskra" % "0.0.3"

Iskra is built with Scala 3.1.3 so it's compatible with Scala 3.1.x and newer minor releases (starting from 3.2.0 you'll get code completions for names of columns in REPL and Metals!). Iskra transitively depends on Spark 3.2.0.

  1. Import the basic definitions from the API
import org.virtuslab.iskra.api.*
  1. Get a Spark session, e.g.
given spark: SparkSession =
  SparkSession
    .builder()
    .master("local")
    .appName("my-spark-app")
    .getOrCreate()
  1. Create a typed data frame in either of the two ways:
  • by using toTypedDF extension method on a Seq of case classes, e.g.
Seq(Foo(1, "abc"), Foo(2, "xyz")).toTypedDF
  • by taking a good old (untyped) data frame and calling typed extension method on it with a type parameter representing a case class, e.g.
df.typed[Foo]

In case you needed to get back to the unsafe world of untyped data frames for some reason, just call .untyped on a typed data frame.

  1. Follow your intuition of a Spark developer πŸ˜‰

This library intends to maximally resemble the original API of Spark (e.g. by using the same names of methods, etc.) where possible, although trying to make the code feel more like regular Scala without unnecessary boilerplate and adding some other syntactic improvements.

Most important differences:

  • Refer to columns (also with prefixes specifying the alias for a dataframe in case of ambiguities) simply with $.foo.bar instead of $"foo.bar" or col("foo.bar"). Use backticks when necessary, e.g. $.`column with spaces` .
  • From inside of .select(...) or .select{...} you should return something that is a named column or a tuple of named columns. Because of how Scala syntax works you can write simply .select($.x, $.y) instead of select(($.x, $.y)). With braces you can compute intermediate values like
.select {
  val sum = ($.x + $.y).as("sum")
  ($.x, $.y, sum)
}
  • Syntax for joins looks slightly more like SQL, but with dots and parentheses as for usual method calls, e.g.
foos.innerJoin(bars).on($.foos.barId === $.bars.id).select(...)
  • As you might have noticed above, the aliases for foos and bars were automatically inferred
  1. For reference look at the examples and the API docs

Local development

This project is built using scala-cli so just use the traditional commands with . as root like scala-cli compile . or scala-cli test ..

For a more recent version of Usage section look here

More Repositories

1

git-machete

Probably the sharpest git repository organizer & rebase/merge workflow automation tool you've ever seen
Python
906
star
2

scala-cli

Scala CLI is a command-line tool to interact with the Scala language. It lets you compile, run, test, and package your Scala code (and more!)
Scala
544
star
3

graphbuddy

Graph Buddy helps you to understand the code better
HTML
149
star
4

render

Universal data-driven template for generating textual output, as a static binary and a library
Go
140
star
5

git-machete-intellij-plugin

Probably the sharpest git repository organizer & rebase/merge workflow automation tool you've ever seen
Java
134
star
6

besom

Besom - a Pulumi SDK for Scala. Also, incidentally, a broom made of twigs tied round a stick. Brooms and besoms are used for protection, to ward off evil spirits, and cleansing of ritual spaces.
Scala
124
star
7

pandas-stubs

Pandas type stubs. Helps you type-check your code.
Python
120
star
8

unicorn

Small Slick library for type-safe id handling
Scala
112
star
9

scala-yaml

Scala
92
star
10

Inkuire

Hoogle-like searches for Scala 3 and Kotlin
Scala
91
star
11

avocADO

Safe compile-time parallelization of for-comprehensions for Scala 3
Scala
87
star
12

jenkins-operator

Kubernetes native Jenkins Operator, moved to https://github.com/jenkinsci/kubernetes-operator
Go
82
star
13

bazel-steward

A bot to keep Bazel dependencies up to date
Kotlin
60
star
14

pretty-stacktraces

Scala
58
star
15

tetrisly-react

Tetrisly offers user-friendly components designed for effortless integration. Plus, it's fully compatible with Tetrisly for Figma with a seamless design and development experience in mind.
TypeScript
41
star
16

crypt

Universal cryptographic tool with AWS KMS, GCP KMS, GnuPG and Azure Key Vault support
Go
33
star
17

infrastructure-as-types

Infrastructure as Types - modern infrastructure declaration and deployment toolkit
Scala
26
star
18

ide-probe

Scala
26
star
19

beholder

Small slick lib for create views on on database
Scala
26
star
20

akka-serialization-helper

Serialization toolbox for Akka messages, events and persistent state that helps achieve compile-time guarantee on serializability. No more errors in the runtime!
Scala
26
star
21

activator-play-advanced-slick

Typesafe Activator template for advanced play-slick project
HTML
20
star
22

kubedrainer

Simple Kubernetes Node Drainer
Go
20
star
23

contextbuddy

Platform documentation
CSS
16
star
24

strapi-plugin-content-manager-extension-hierarchical

strapi-plugin-content-manager-extension-hierarchical
JavaScript
15
star
25

community-build3

Scala
14
star
26

scala-packager

Scala
13
star
27

pekko-serialization-helper

Serialization toolbox for Pekko messages, events and persistent state that helps achieve compile-time guarantee on serializability. No more errors in the runtime!
Scala
13
star
28

using_directives

Java
12
star
29

coursier-m1

A small repo to release coursier using self-hosted Mac M1 runner
Shell
11
star
30

vss

Scala
10
star
31

scg-cli

scg-cli is a CLI tool for Semantic Code Graph analysis
Scala
9
star
32

ReactSphere-reactive-beyond-hype

Repo for presentation on ReactSphere: Reactive beyond hype
HCL
8
star
33

akka-workshop-client

Base code for akka workshop.
Scala
8
star
34

scala-compose

Scala
7
star
35

scala.today

Scala
7
star
36

scala-snippet-checker

TypeScript
6
star
37

codetale

CodeTale - documentation & issue tracking
Dockerfile
6
star
38

genesis

Common sbt settings for sbt-based projects
Scala
5
star
39

pyspark-workshop

HTML
5
star
40

data_lake_navigation_atlas

Code for blogpost Navigation in the data lake using Atlas
Scala
5
star
41

ddd-public-materials

All public materials for community prepared by The DDD guild from Virtuslab
Kotlin
5
star
42

kleisli-examples

Examples from blog post on Kleisli arrows
Scala
5
star
43

tips

CSS
4
star
44

go-extended

Things missing or not belonging in the standard library
Go
3
star
45

mesos-on-vagrant

Just a Vagrant file and Ansible playbook for deploying Mesos cluster for testing
3
star
46

scala-workshop

Scala
3
star
47

talk-scala-akka-play

Introductory talk to Scala, Akka and Play! framework
JavaScript
3
star
48

base-types-kt

Library with common types for Kotlin supporting domain-driven functional programming
Kotlin
3
star
49

Edison-BlinkOnboard

Example Akka system for blinking LED at Intel Edison
Scala
3
star
50

jira-stats

Exports some metrics from jira via REST api - currently calculated dev days per ticket
Go
2
star
51

runscope-agent

Containerized Runscope Agent (Dockerfile)
Makefile
2
star
52

homebrew-cloud

This repository contains a collection of Homebrew formulas.
Ruby
2
star
53

scala-cli-packages

Shell
2
star
54

spark_sql_under_the_hood

Code for blogpost: Spark SQL under the hood
Scala
2
star
55

homebrew-scala-cli

Ruby
1
star
56

scalacamp

ScalaCamp.pl site source code
HTML
1
star
57

homebrew-git-machete

1
star
58

scala-workshop-bootstrap

Shell
1
star
59

gpki

Git Public Key Infrastructure
Python
1
star
60

jenkins-operator-assets

Hosting Jenkins Operator assests like images or CSS files.
CSS
1
star
61

virtusity-workshop-graphql

TypeScript
1
star
62

dokka-site

Kotlin
1
star
63

kibana-rpm-packaging

1
star
64

kubectl-deploy

Simple kubectl plugin for rendering and applying Kubernetes manifests
Go
1
star
65

akka-http-kubernetes.g8

Scala
1
star
66

scala-cli.g8

Scala
1
star
67

homebrew-scala-experimental

Ruby
1
star
68

scg-scala

Scala compiler plugin for Semantic Code Graph generation
Scala
1
star
69

ide-probe-tests

Scala
1
star
70

Ariadne-Bootloader

A little less unfinished TFTP bootloader for Arduino Leonardo Ethernet
Arduino
1
star
71

ScalaTastiesScrapper

Scala
1
star
72

scala3-workshop

Scala
1
star
73

aws-cli

Yet Another Dockerized AWS CLI
Makefile
1
star
74

besom-ask-me

Scala
1
star
75

shuttlecraft

Scala
1
star