Overview
This project is a sandbox and playground of mine for experimenting ideas and potential improvements to Spark SQL. It consists of:
- A parser that parses a small SQL dialect into unresolved logical plans
- A semantic analyzer that resolves unresolved logical plans into resolved ones
- A query optimizer that optimizes resolved query plans into equivalent but more performant ones
- A query planner that turns (optimized) logical plans into executable physical plans
Currently Spear only works with local Scala collections.
Build
Building Spear is as easy as:
$ ./build/sbt package
Run the REPL
Spear has an Ammonite-based REPL for interactive experiments. To start it:
$ ./build/sbt spear-repl/run
Let's create a simple DataFrame of numbers:
@ context range 10 show ()
ββββ
βidβ
ββββ€
β 0β
β 1β
β 2β
β 3β
β 4β
β 5β
β 6β
β 7β
β 8β
β 9β
ββββ
A sample query using the DataFrame API:
@ context.
range(10).
select('id as 'key, (rand(42) * 100) cast IntType as 'value).
where('value % 2 === 0).
orderBy('value.desc).
show()
βββββ€ββββββ
βkeyβvalueβ
βββββΌββββββ€
β 5β 90β
β 9β 78β
β 0β 72β
β 1β 68β
β 4β 66β
β 8β 46β
β 6β 36β
β 2β 30β
βββββ§ββββββ
Equivalent sample query using SQL:
@ context range 10 asTable 't // Registers a temporary table first
@ context.sql(
"""SELECT * FROM (
| SELECT id AS key, CAST(RAND(42) * 100 AS INT) AS value FROM t
|) s
|WHERE value % 2 = 0
|ORDER BY value DESC
|""".stripMargin
).show()
βββββ€ββββββ
βkeyβvalueβ
βββββΌββββββ€
β 5β 90β
β 9β 78β
β 0β 72β
β 1β 68β
β 4β 66β
β 8β 46β
β 6β 36β
β 2β 30β
βββββ§ββββββ
We can also check the query plan using explain()
:
@ context.
range(10).
select('id as 'key, (rand(42) * 100) cast IntType as 'value).
where('value % 2 === 0).
orderBy('value.desc).
explain(true)
# Logical plan
Sort: order=[$0] β [?output?]
β β°β΄$0: `value` DESC NULLS FIRST
β°β΄Filter: condition=$0 β [?output?]
β β°β΄$0: ((`value` % 2:INT) = 0:INT)
β°β΄Project: projectList=[$0, $1] β [?output?]
β ββ΄$0: (`id` AS `key`#11)
β β°β΄$1: (CAST((RAND(42:INT) * 100:INT) AS INT) AS `value`#12)
β°β΄LocalRelation: data=<local-data> β [`id`#10:BIGINT!]
# Analyzed plan
Sort: order=[$0] β [`key`#11:BIGINT!, `value`#12:INT!]
β β°β΄$0: `value`#12:INT! DESC NULLS FIRST
β°β΄Filter: condition=$0 β [`key`#11:BIGINT!, `value`#12:INT!]
β β°β΄$0: ((`value`#12:INT! % 2:INT) = 0:INT)
β°β΄Project: projectList=[$0, $1] β [`key`#11:BIGINT!, `value`#12:INT!]
β ββ΄$0: (`id`#10:BIGINT! AS `key`#11)
β β°β΄$1: (CAST((RAND(CAST(42:INT AS BIGINT)) * CAST(100:INT AS DOUBLE)) AS INT) AS `value`#12)
β°β΄LocalRelation: data=<local-data> β [`id`#10:BIGINT!]
# Optimized plan
Sort: order=[$0] β [`key`#11:BIGINT!, `value`#12:INT!]
β β°β΄$0: `value`#12:INT! DESC NULLS FIRST
β°β΄Filter: condition=$0 β [`key`#11:BIGINT!, `value`#12:INT!]
β β°β΄$0: ((`value`#12:INT! % 2:INT) = 0:INT)
β°β΄Project: projectList=[$0, $1] β [`key`#11:BIGINT!, `value`#12:INT!]
β ββ΄$0: (`id`#10:BIGINT! AS `key`#11)
β β°β΄$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
β°β΄LocalRelation: data=<local-data> β [`id`#10:BIGINT!]
# Physical plan
Sort: order=[$0] β [`key`#11:BIGINT!, `value`#12:INT!]
β β°β΄$0: `value`#12:INT! DESC NULLS FIRST
β°β΄Filter: condition=$0 β [`key`#11:BIGINT!, `value`#12:INT!]
β β°β΄$0: ((`value`#12:INT! % 2:INT) = 0:INT)
β°β΄Project: projectList=[$0, $1] β [`key`#11:BIGINT!, `value`#12:INT!]
β ββ΄$0: (`id`#10:BIGINT! AS `key`#11)
β β°β΄$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
β°β΄LocalRelation: data=<local-data> β [`id`#10:BIGINT!]