• Stars
    star
    138
  • Rank 255,072 (Top 6 %)
  • Language
    Scala
  • Created over 8 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A playground for experimenting ideas that may apply to Spark SQL/Catalyst

Overview

Build Status codecov.io

Codecov.io

This project is a sandbox and playground of mine for experimenting ideas and potential improvements to Spark SQL. It consists of:

  • A parser that parses a small SQL dialect into unresolved logical plans
  • A semantic analyzer that resolves unresolved logical plans into resolved ones
  • A query optimizer that optimizes resolved query plans into equivalent but more performant ones
  • A query planner that turns (optimized) logical plans into executable physical plans

Currently Spear only works with local Scala collections.

Build

Building Spear is as easy as:

$ ./build/sbt package

Run the REPL

Spear has an Ammonite-based REPL for interactive experiments. To start it:

$ ./build/sbt spear-repl/run

Let's create a simple DataFrame of numbers:

@ context range 10 show ()
โ•’โ•โ•โ••
โ”‚idโ”‚
โ”œโ”€โ”€โ”ค
โ”‚ 0โ”‚
โ”‚ 1โ”‚
โ”‚ 2โ”‚
โ”‚ 3โ”‚
โ”‚ 4โ”‚
โ”‚ 5โ”‚
โ”‚ 6โ”‚
โ”‚ 7โ”‚
โ”‚ 8โ”‚
โ”‚ 9โ”‚
โ•˜โ•โ•โ•›

A sample query using the DataFrame API:

@ context.
    range(10).
    select('id as 'key, (rand(42) * 100) cast IntType as 'value).
    where('value % 2 === 0).
    orderBy('value.desc).
    show()
โ•’โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ••
โ”‚keyโ”‚valueโ”‚
โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  5โ”‚   90โ”‚
โ”‚  9โ”‚   78โ”‚
โ”‚  0โ”‚   72โ”‚
โ”‚  1โ”‚   68โ”‚
โ”‚  4โ”‚   66โ”‚
โ”‚  8โ”‚   46โ”‚
โ”‚  6โ”‚   36โ”‚
โ”‚  2โ”‚   30โ”‚
โ•˜โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•›

Equivalent sample query using SQL:

@ context range 10 asTable 't // Registers a temporary table first

@ context.sql(
    """SELECT * FROM (
      |  SELECT id AS key, CAST(RAND(42) * 100 AS INT) AS value FROM t
      |) s
      |WHERE value % 2 = 0
      |ORDER BY value DESC
      |""".stripMargin
  ).show()
โ•’โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ••
โ”‚keyโ”‚valueโ”‚
โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  5โ”‚   90โ”‚
โ”‚  9โ”‚   78โ”‚
โ”‚  0โ”‚   72โ”‚
โ”‚  1โ”‚   68โ”‚
โ”‚  4โ”‚   66โ”‚
โ”‚  8โ”‚   46โ”‚
โ”‚  6โ”‚   36โ”‚
โ”‚  2โ”‚   30โ”‚
โ•˜โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•›

We can also check the query plan using explain():

@ context.
    range(10).
    select('id as 'key, (rand(42) * 100) cast IntType as 'value).
    where('value % 2 === 0).
    orderBy('value.desc).
    explain(true)
# Logical plan
Sort: order=[$0] โ‡’ [?output?]
โ”‚ โ•ฐโ•ด$0: `value` DESC NULLS FIRST
โ•ฐโ•ดFilter: condition=$0 โ‡’ [?output?]
  โ”‚ โ•ฐโ•ด$0: ((`value` % 2:INT) = 0:INT)
  โ•ฐโ•ดProject: projectList=[$0, $1] โ‡’ [?output?]
    โ”‚ โ”œโ•ด$0: (`id` AS `key`#11)
    โ”‚ โ•ฐโ•ด$1: (CAST((RAND(42:INT) * 100:INT) AS INT) AS `value`#12)
    โ•ฐโ•ดLocalRelation: data=<local-data> โ‡’ [`id`#10:BIGINT!]

# Analyzed plan
Sort: order=[$0] โ‡’ [`key`#11:BIGINT!, `value`#12:INT!]
โ”‚ โ•ฐโ•ด$0: `value`#12:INT! DESC NULLS FIRST
โ•ฐโ•ดFilter: condition=$0 โ‡’ [`key`#11:BIGINT!, `value`#12:INT!]
  โ”‚ โ•ฐโ•ด$0: ((`value`#12:INT! % 2:INT) = 0:INT)
  โ•ฐโ•ดProject: projectList=[$0, $1] โ‡’ [`key`#11:BIGINT!, `value`#12:INT!]
    โ”‚ โ”œโ•ด$0: (`id`#10:BIGINT! AS `key`#11)
    โ”‚ โ•ฐโ•ด$1: (CAST((RAND(CAST(42:INT AS BIGINT)) * CAST(100:INT AS DOUBLE)) AS INT) AS `value`#12)
    โ•ฐโ•ดLocalRelation: data=<local-data> โ‡’ [`id`#10:BIGINT!]

# Optimized plan
Sort: order=[$0] โ‡’ [`key`#11:BIGINT!, `value`#12:INT!]
โ”‚ โ•ฐโ•ด$0: `value`#12:INT! DESC NULLS FIRST
โ•ฐโ•ดFilter: condition=$0 โ‡’ [`key`#11:BIGINT!, `value`#12:INT!]
  โ”‚ โ•ฐโ•ด$0: ((`value`#12:INT! % 2:INT) = 0:INT)
  โ•ฐโ•ดProject: projectList=[$0, $1] โ‡’ [`key`#11:BIGINT!, `value`#12:INT!]
    โ”‚ โ”œโ•ด$0: (`id`#10:BIGINT! AS `key`#11)
    โ”‚ โ•ฐโ•ด$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
    โ•ฐโ•ดLocalRelation: data=<local-data> โ‡’ [`id`#10:BIGINT!]

# Physical plan
Sort: order=[$0] โ‡’ [`key`#11:BIGINT!, `value`#12:INT!]
โ”‚ โ•ฐโ•ด$0: `value`#12:INT! DESC NULLS FIRST
โ•ฐโ•ดFilter: condition=$0 โ‡’ [`key`#11:BIGINT!, `value`#12:INT!]
  โ”‚ โ•ฐโ•ด$0: ((`value`#12:INT! % 2:INT) = 0:INT)
  โ•ฐโ•ดProject: projectList=[$0, $1] โ‡’ [`key`#11:BIGINT!, `value`#12:INT!]
    โ”‚ โ”œโ•ด$0: (`id`#10:BIGINT! AS `key`#11)
    โ”‚ โ•ฐโ•ด$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
    โ•ฐโ•ดLocalRelation: data=<local-data> โ‡’ [`id`#10:BIGINT!]