Overview
This project is a sandbox and playground of mine for experimenting ideas and potential improvements to Spark SQL. It consists of:
- A parser that parses a small SQL dialect into unresolved logical plans
- A semantic analyzer that resolves unresolved logical plans into resolved ones
- A query optimizer that optimizes resolved query plans into equivalent but more performant ones
- A query planner that turns (optimized) logical plans into executable physical plans
Currently Spear only works with local Scala collections.
Build
Building Spear is as easy as:
$ ./build/sbt package
Run the REPL
Spear has an Ammonite-based REPL for interactive experiments. To start it:
$ ./build/sbt spear-repl/run
Let's create a simple DataFrame of numbers:
@ context range 10 show ()
โโโโ
โidโ
โโโโค
โ 0โ
โ 1โ
โ 2โ
โ 3โ
โ 4โ
โ 5โ
โ 6โ
โ 7โ
โ 8โ
โ 9โ
โโโโ
A sample query using the DataFrame API:
@ context.
range(10).
select('id as 'key, (rand(42) * 100) cast IntType as 'value).
where('value % 2 === 0).
orderBy('value.desc).
show()
โโโโโคโโโโโโ
โkeyโvalueโ
โโโโโผโโโโโโค
โ 5โ 90โ
โ 9โ 78โ
โ 0โ 72โ
โ 1โ 68โ
โ 4โ 66โ
โ 8โ 46โ
โ 6โ 36โ
โ 2โ 30โ
โโโโโงโโโโโโ
Equivalent sample query using SQL:
@ context range 10 asTable 't // Registers a temporary table first
@ context.sql(
"""SELECT * FROM (
| SELECT id AS key, CAST(RAND(42) * 100 AS INT) AS value FROM t
|) s
|WHERE value % 2 = 0
|ORDER BY value DESC
|""".stripMargin
).show()
โโโโโคโโโโโโ
โkeyโvalueโ
โโโโโผโโโโโโค
โ 5โ 90โ
โ 9โ 78โ
โ 0โ 72โ
โ 1โ 68โ
โ 4โ 66โ
โ 8โ 46โ
โ 6โ 36โ
โ 2โ 30โ
โโโโโงโโโโโโ
We can also check the query plan using explain()
:
@ context.
range(10).
select('id as 'key, (rand(42) * 100) cast IntType as 'value).
where('value % 2 === 0).
orderBy('value.desc).
explain(true)
# Logical plan
Sort: order=[$0] โ [?output?]
โ โฐโด$0: `value` DESC NULLS FIRST
โฐโดFilter: condition=$0 โ [?output?]
โ โฐโด$0: ((`value` % 2:INT) = 0:INT)
โฐโดProject: projectList=[$0, $1] โ [?output?]
โ โโด$0: (`id` AS `key`#11)
โ โฐโด$1: (CAST((RAND(42:INT) * 100:INT) AS INT) AS `value`#12)
โฐโดLocalRelation: data=<local-data> โ [`id`#10:BIGINT!]
# Analyzed plan
Sort: order=[$0] โ [`key`#11:BIGINT!, `value`#12:INT!]
โ โฐโด$0: `value`#12:INT! DESC NULLS FIRST
โฐโดFilter: condition=$0 โ [`key`#11:BIGINT!, `value`#12:INT!]
โ โฐโด$0: ((`value`#12:INT! % 2:INT) = 0:INT)
โฐโดProject: projectList=[$0, $1] โ [`key`#11:BIGINT!, `value`#12:INT!]
โ โโด$0: (`id`#10:BIGINT! AS `key`#11)
โ โฐโด$1: (CAST((RAND(CAST(42:INT AS BIGINT)) * CAST(100:INT AS DOUBLE)) AS INT) AS `value`#12)
โฐโดLocalRelation: data=<local-data> โ [`id`#10:BIGINT!]
# Optimized plan
Sort: order=[$0] โ [`key`#11:BIGINT!, `value`#12:INT!]
โ โฐโด$0: `value`#12:INT! DESC NULLS FIRST
โฐโดFilter: condition=$0 โ [`key`#11:BIGINT!, `value`#12:INT!]
โ โฐโด$0: ((`value`#12:INT! % 2:INT) = 0:INT)
โฐโดProject: projectList=[$0, $1] โ [`key`#11:BIGINT!, `value`#12:INT!]
โ โโด$0: (`id`#10:BIGINT! AS `key`#11)
โ โฐโด$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
โฐโดLocalRelation: data=<local-data> โ [`id`#10:BIGINT!]
# Physical plan
Sort: order=[$0] โ [`key`#11:BIGINT!, `value`#12:INT!]
โ โฐโด$0: `value`#12:INT! DESC NULLS FIRST
โฐโดFilter: condition=$0 โ [`key`#11:BIGINT!, `value`#12:INT!]
โ โฐโด$0: ((`value`#12:INT! % 2:INT) = 0:INT)
โฐโดProject: projectList=[$0, $1] โ [`key`#11:BIGINT!, `value`#12:INT!]
โ โโด$0: (`id`#10:BIGINT! AS `key`#11)
โ โฐโด$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
โฐโดLocalRelation: data=<local-data> โ [`id`#10:BIGINT!]