• Stars
    star
    180
  • Rank 213,097 (Top 5 %)
  • Language
    CoffeeScript
  • Created over 8 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Ultra Tiny Compiler

Ultra Tiny Compiler

This is Ultra Tiny Compiler for any C-like expression into lisp.
If remove all comments, source code will be less then <90 lines of actual code. By the way, you are viewing source code itself (yes, this readme file also is source code). It's written in literate coffescript and fully tested Build Status and published on npm. You can install it via npm install ultra-tiny-compiler.

What this compiler can do?

Examples
Input Output
1 + 2 * 3 (+ 1 (* 2 3))
1 + exp(i * pi) (+ 1 (exp (* i pi)))
pow(1 + 1 / n, n) (pow (+ 1 (/ 1 n)) n)

Theory

We begin by describing grammar of C-like expressions, more precisely context-free grammar. A context-free grammar has four components:

  1. A set of terminal symbols (tokens).
  2. A set of nonterminals (syntactic variables).
  3. A set of productions, where each production consists of a nonterminal, called the head or left side of the production, an arrow, and a sequence of terminals and/or nonterminals, called the body or right side of the production.
  4. A designation of one of the nonterminals as the start symbol.

For notational convenience, productions with the same nonterminal as the head can have their bodies grouped by symbol |.

Here is our grammar for C-like expressions:

expr โ†’ expr + term | expr - term | term
term โ†’ term * factor | term / factor | factor
factor โ†’ ( expr ) | atom | call
call โ†’ atom ( list )
list โ†’ list , expr | expr
atom โ†’ [a-z0-9]+

Here expr, term, factor, ... a nonterminals and +, -, *, /, (, ), ... a terminals, expr is a start symbol.

Usually compiler consist of several parts: lexer, parser, code generator. Lexer reads input stream, split it into tokens and pass them into parser. Parser uses grammar to build parse tree and generate abstract syntax tree (AST for short). AST resemble parse tree to an extent. However, in AST interior nodes represent programming constructs while in the parse tree, the interior nodes represent nonterminals. Code generator traverses AST and generates code.

But we a going to simplify our compiler and combine parse phase with code generation phase. To accomplish this we will use syntax-directed definitions (semantic actions) for translating expressions into postfix notation.

expr โ†’ expr + term {puts("+")}
expr โ†’ expr - term {puts("-")}
atom โ†’ [a-z0-9]+ {puts(atom)}
...

Expression 9-5+2 will be translating into 95-2+ with this semantic actions.

For parsing we will use simple form of recursive-descent parsing, called predictive parsing, in which the lookahead symbol unambiguously determines the flow of control through the procedure body for each nonterminal. It is possible for a recursive-descent parser to loop forever. A problem arises with left-recursive productions like expr โ†’ expr + term.

A left-recursive production can be eliminated by rewriting the offending production:

A โ†’ A โบ | ฮฒ

Into right-recursive production:

A โ†’ ฮฒ R
R โ†’ โบ R | โˆ…

Using this technique we can transform our grammer into new grammar:

expr โ†’ term exprR
exprR โ†’ + term exprR | - term exprR | โˆ…
term โ†’ factor termR
termR โ†’ * factor termR | / factor termR | โˆ…
factor โ†’ ( expr ) | atom | call
call โ†’ atom ( list )
list โ†’ expr listR
listR โ†’ , expr listR | โˆ…
atom โ†’ [a-z0-9]+

Semantic actions embedded in the production are simply carried along in the transformation, as if they were terminals.

Okay, let's write some code. We need only one function compile(input: string) which will do all the work.

compile = (input) -> 

First start with lexer what will be splitting input sequence of characters into tokens.
For example next input string pow(1, 2 + 3) will be transformed into array ['pow', '(', '1', ',', '2', '+', '3', ')'].

  tokens = []

Every of our tokens will be one character, only atoms can be any length of letters or numbers.

  is_atom = /[a-z0-9]/i

Iterate throw characters of input. Note what inside this loop may be another loop which also increments i value.

  i = 0
  while i < input.length 
    switch char = input[i]

When meet one of next character, put it as token and continue to next character.

      when "+", "-", "*", "/", "(", ")", ","
        tokens.push char
        i++

Skip whitespaces.

      when " "
        i++

If character is unknown,

      else

Loop through each character in sequence until we encounter a character that is not an atom.

        if is_atom.test char
          tokens.push do ->
            value = ''
            while char and is_atom.test char
              value += char
              char = input[++i]
            value

Throw error on an unknown character.

        else
          throw "Unknown input char: #{char}"

We need a function which will be giving us a token from a tokens stream. Let's call it next.

  next = -> tokens.shift()

Lookahead is very important part of our compiler. Allows to determine what kind of production we are hitting next.

  lookahead = next()

Another important function which can match given terminal with lookahead terminal and move lookahead to next token.

  match = (terminal) ->
    if lookahead == terminal then lookahead = next()
    else throw "Syntax error: Expected token #{terminal}, got #{lookahead}"

Sometimes we a going to hit into wrong production, and we need a function which allows us to return to previous state.

  recover = (token) ->
    tokens.unshift lookahead
    lookahead = token

Next, we a going to write our production rules. Each nonterminal represents corresponding function call, each terminal represents match function call. Also, we omitted โˆ… production.
expr function represents next production rule:
expr โ†’ term exprR

  expr = ->
    term(); exprR()

Will be using lookahead for determine which production to use. Here also our first semantic actions which puts + or - onto stack. Ensure preserve ordering of semantic actions.
exprR โ†’ + term {puts("+")} exprR | - term {puts("-")} exprR | โˆ…

  exprR = ->
    if lookahead == "+"
      match("+"); term(); puts("+"); exprR()
    else if lookahead == "-"
      match("-"); term(); puts("-"); exprR()

term โ†’ factor termR

  term = ->
    factor(); termR()

termR โ†’ * factor {puts("*")} termR | / factor {puts("/")} termR | โˆ…

  termR = -> 
    if lookahead == "*"
      match("*"); factor(); puts("*"); termR()
    else if lookahead == "/"
      match("/"); factor(); puts("/"); termR()

Next goes tricky production rule. First, we lookahead if there (, which will mean what current we at expression in brackets. Second, try use production rule for function call. Third, if function call production fails, consider current token as an atom.
factor โ†’ ( expr ) | atom | call

  factor = ->
    if lookahead == "("
      match("("); expr(); match(")")
    else 
      atom() unless call() 

Call production should allow recovering if we chose it incorrectly.
Remember current token, and move to next token with match. If after goes an (, when we choose call production correctly, otherwise recover to previous state. Then we hitting list of arguments, puts special mark onto stack, this allows us to know how many arguments contains in this list.
call โ†’ atom ( list )

  call = ->
    token = lookahead
    match(lookahead)
    if lookahead == "("
      puts(false, "mark"); match("("); list(); match(")"); puts(token, "call")
      true
    else
      recover(token)
      false

list โ†’ expr listR

  list = ->
    expr(); listR();

listR โ†’ , expr listR | โˆ…

  listR = ->
    if lookahead == ","
      match(","); expr(); listR();

At the bottom lies atom production. If we went down to this production, we expecting to get an atom.
Otherwise, it's some syntax error.
atom โ†’ [a-z0-9]+

  atom = ->
    if is_atom.test lookahead
      match(puts(lookahead, "atom"))
    else
      throw "Syntax error: Unexpected token #{lookahead}"

In semantic rules, we use puts function, which records operators and atoms into stack in reverse polish notation (RPN).
But instead recording entire program in RPN, we are going to do code generation on the fly. For that, we must understand how to postfix algorithm works.

For example, we have a stream of tokens: 1, 2, 3, +, -

  1. Put 1 onto stack.
  2. Put 2 onto stack.
  3. Put 3 onto stack.
  4. When + pop two values (3,2) from stack and put (+ 2 3) onto stack.
  5. When - pop two values ((+ 2 3),1) from stack and put (- 1 (+ 2 3)) onto stack back.

Generated code will be on top of stack and stack size will be one, if stream was complete.

  stack = []
  puts = (token, type="op") ->
    switch type

Then operators comes in, pop two values from stack, generate code for that operator and push generated code back into stack.

      when "op"
        op = token
        y = stack.pop()
        x = stack.pop()
        stack.push "(#{op} #{x} #{y})"

Do same thing for call, but instead of gathering two values from stack, take all values, until false shows up from stack. false represents special mark to know their arguments ends.

      when "call"
        func = token
        args = []
        while arg = stack.pop()
          break unless arg
          args.unshift arg
        stack.push "(#{func} #{args.join(' ')})"

Any other atoms or marks push to stack as it appears.

      else 
        stack.push token

Return token from puts function for reuse.

    token

At this point, we described all function what we need, and now we can start parsing and compilation at same time.

  expr()

If parsing pass well, we end up with stack with only one item in it. It's our compiled code. Let's return it from the function.

  stack[0]

Expose the world.

module.exports = compile

What's it. We just wrote out compiler.

More Repositories

1

fx

Terminal JSON viewer
Go
16,770
star
2

expr

Expression language and expression evaluation for Go
Go
3,910
star
3

monkberry

Monkberry is a JavaScript library for building web user interfaces
JavaScript
1,496
star
4

codejar

An embeddable code editor for the browser ๐Ÿฏ
TypeScript
1,489
star
5

red

Terminal log analysis tools
Go
1,436
star
6

llama

Terminal file manager
Go
1,421
star
7

finder

CSS Selector Generator ๐Ÿ—บ
HTML
1,014
star
8

countdown

Terminal countdown timer
Go
948
star
9

numbr

Notepad + calculator
TypeScript
319
star
10

eat

Eats anything, spits out JSON ๐Ÿง€
JavaScript
287
star
11

console

Web PHP Console
PHP
268
star
12

gofx

๐Ÿพ fx-like command-line JSON processing tool
Go
233
star
13

jsize

Find out minified and gzipped npm package size
JavaScript
177
star
14

fx-completion

Bash completion for fx
JavaScript
168
star
15

watch

watch tool rewritten in go
Go
153
star
16

spark

GitHub Stars Sparklines โšก๏ธ
JavaScript
129
star
17

cherimola

A very useful things.
PHP
108
star
18

silicone-skeleton

Silicone Skeleton is Silex Framework Edition Skeleton.
PHP
106
star
19

purephp

PurePHP Key-Value Storage
PHP
88
star
20

tinysh

A tiny spawn wrapper for Node.js
JavaScript
57
star
21

chat

PHP Chat Example
JavaScript
54
star
22

fast-json

Fast extraction of part of JSON
JavaScript
52
star
23

ll

Opinionated ls rewrite in Go ๐Ÿงฆ
Go
41
star
24

golang-expression-evaluation-comparison

Go expression evaluation comparison
Go
39
star
25

asciitree

Draw vertical ASCII tree
JavaScript
34
star
26

homer

Internet search engine on React PHP
JavaScript
28
star
27

damka

Russian checkers game
Go
17
star
28

svg-embed

Embed SVG code into DOM. 600 Bytes (gzip)
HTML
16
star
29

year

All unix epoch dates
JavaScript
13
star
30

silicone

Silicone - Organic Silex Framework Edition
PHP
10
star
31

gatsby-source-google-analytics-reporting-api

Gatsby source for Google Anatytics Reporting API
JavaScript
10
star
32

list

Immutable lists in JavaScript without [] and {}
JavaScript
8
star
33

is-it-cloudy

Command line tool to printing weather info ๐ŸŒฆ
JavaScript
8
star
34

prettyjson

๐Ÿงข Pretty print JSON
JavaScript
8
star
35

fx-theme-monokai

Monokai theme for fx
JavaScript
8
star
36

granula

Granula ORM
PHP
7
star
37

kot

๐ŸฑIt's a kot!
JavaScript
7
star
38

fx-theme-night

Night theme for fx
JavaScript
6
star
39

mustcheck

Must & Check
Go
5
star
40

find-npm-name

Find available npm name
JavaScript
5
star
41

lazy-chain

lazy-chain is a JavaScript utility library for ES6
JavaScript
5
star
42

morrow

A text-based role-paying game
TypeScript
3
star
43

sshlogger

SSH Logger
Go
3
star
44

numbr.dev

Numbr Private Code
1
star
45

tto

Tic-Tac-Toe game buid with @medv/list
JavaScript
1
star