• Stars
    star
    132
  • Rank 272,626 (Top 6 %)
  • Language
    C
  • License
    GNU General Publi...
  • Created over 10 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CC500: a tiny self-hosting C compiler

cc500

CC500: a tiny self-hosting C compiler

http://homepage.ntlworld.com/edmund.grimley-evans/cc500/

CC500: a tiny self-hosting C compiler

Introduction

I wrote this tiny compiler, which translates a subset of C into x86 machine code, for fun. It has no use, unless it counts as educational. I called it CC500 because I initally guessed it would take about 500 lines. It turned out to be about 600 lines even without the comments and blank lines. With the comments and blank lines it has about 750 lines. It could be made shorter, but I wanted the code to be clear and simple rather than obfuscated.

Download

cc500.c - just the code Compilation:

gcc cc500.c -o cc500_1 Self-compilation:

./cc500_1 < cc500.c > cc500_2 chmod +x cc500_2 How it works

The compiler reads the whole program in C from stdin and writes an ELF binary for Linux/x86 to stdout. The ELF binary is somewhat non-standard as it consists of a single section that is both writable and executable. It seems to work for me on a real Linux box, but it might cause problems on an emulator.

No libraries are used. Tiny, machine-code implementations of exit(), getchar(), malloc() and putchar() are included in the binary that is generated. There is no free() as malloc() is just sbrk(), really, implemented with two calls to brk().

There is almost no error-checking. With some invalid C programs the compiler calls exit(1) instead of generating a binary, but in most cases it generates a binary that crashes when you run it.

There is no preprocessor. The lexer looks for tokens that match one of the following Perl regular expressions: /[a-z0-9_]+/, /[<=>|&!]+/, /'[^']'/, /"[^"]"/, /./. Traditional C-style comments are handled, but not C++-style comments.

The symbol table is implemented with a single array of char. The type of each symbol is recorded as 'D' (defined global), 'U' (undefined global), 'L' (local variable) or 'A' (argument of a function) - see sym_get_value(). There is also an address recorded for each symbol, but no type, as there is no type-checking and all arrays are assumed to be arrays of char. The address of an 'L' or 'A' is its position on the stack. The address of a 'D' is the position of that function or global variable in the binary. The address of a 'U' is also a pointer into the binary, but a pointer to where the symbol was last used and the head of a linked list of all the forward references to that symbol.

Scoping rules for local variables are implemented correctly as it is easy to do so: at the end of a compound statement the pointer to the end of the symbol table is restored to what it was at the start of the compound statement - see the assignment to table_pos in statement().

I started off writing CC500 without local variables, but it felt very unnatural. Nested scopes are an essential part of C and it is not easy to write a recursive-descent parser without automatic variables.

It is assumed that all variables, arguments and return values are char * or int, though the compiler does not even parse the types, let alone record them in the symbol table. The functions that implement the recursive-descent parser return 1, 2 or 3, corresponding to a "char lval", "int lval" or "other" - see promote().

Each parsing function is preceded by a comment that documents the syntactic structures recognised, in an obvious notation. In some cases the code accepts additional, incorrect structures in addition to what is documented in the comment. Only a tiny subset of C's operators are accepted: just the ones I needed while writing the compiler. There are no unary operators: instead of the indirection operator * you must use x[y], where it is assumed that x is char * and y is int. The only statements recognised are compound statements, expressions, if, while and return. Local variable declarations, with initialisers, are treated as a kind of statement. (If you write something like if (x) int y; the compiler will not complain, but the binary will crash.)

There are no arbitary limits on the length of symbols, number of symbols are the like: all three buffers (token, table and code) are supposed to be automatically extended as required by calls to my_realloc().

Extending it

Some simple and obvious enhancements would be to print an appropriate message message when an error is detected, check for undefined globals, and allow main() to be anywhere in the file.

It would be easy to extend the compiler to handle the remaining operators and statement types of C. To implement break and continue you would probably add additional arguments to the parsing functions that record the address and stack height of the current loop. The easiest way to implement switch statements might be to add an argument that records the location of the previous case label in the current switch statement; a case label would then be translated with a comparison and a branch forwards to the next case label. In general, C is quite amenable to direct translation into machine code without going through an IR (intermediate representation).

Extending the type system would be a bit more challenging. You could correctly implement scalars and pointers, with type checking, by adding a single integer field to the symbol table to record the type. However, if you wanted to implement structs, or function pointers with proper type checking, you would need a recursive data structure to represent the types. It is interesting, but not really surprising, that you start to start to need structs in the implementation language at roughly the same point that you start to implement them. However, C's types get interesting even without structs, unions and typedefs. For example, void (*(f())(void ()(void)))(void); declares a function that returns a pointer to a function that takes a pointer to a function as an argument and returns another pointer to a function. I must admit to finding that part of C's syntax rather difficult to parse mentally. I'm glad we have computers to do it for us.

Links

bcompiler - Bootstrapping a simple compiler from nothing OTCC - The smallest self compiling pseudo C compiler Tiny ELF programs by Brian Raiter Edmund Grimley Evans

More Repositories

1

qbe

A Quick Backend
C
176
star
2

fbui

Framebuffer UI (fbui) in-kernel Linux windowing system.
C
128
star
3

scc

scc - simple C Compiler
C
38
star
4

pijul

Pijul is a free and open source version control system, intended to be simple to use, yet based on a sound theory of collaborative work, and using fast algorithms. Pijul gathers most of the flame-war-features of version control systems: it has branches, first-class patches and snapshots, is distributed (yet can be used in a centralized way), and even has alternative implementations in different programming languages!
OCaml
32
star
5

ff

A Simple Forth System for Linux on i386 and ARM CPUs
Forth
30
star
6

bcpl

BCPL is a simple typeless language that was designed in 1966 by Martin Richards
Limbo
27
star
7

asif

A Self-Interpreter for F-omega
Haskell
16
star
8

otcc

Obfuscated Tiny C Compiler
C
15
star
9

ged

simple gtk text editor written in c
C
10
star
10

miniPicoLisp

A kind of "pure" PicoLisp (not "pure Lisp"!).
C
10
star
11

pilos

PilOS - A Stand-Alone Operating System
PicoLisp
9
star
12

hsq

Higher Subleq is a simplified typeless C language.
C++
9
star
13

Bcachefs

Bcache is a Linux kernel block layer cache. It allows one or more fast disk drives such as flash-based solid state drives (SSDs) to act as a cache for one or more slower hard disk drives.
C
9
star
14

vlink

A portable linker for multiple file formats.
C
8
star
15

myrddin

C
8
star
16

HeavyThing

https://2ton.com.au/HeavyThing/
Assembly
7
star
17

tccboot

TCCBOOT: TinyCC Boot Loader
C
6
star
18

wind

Wind is a window manager for the X Window System
C
6
star
19

linoleum

HTML
5
star
20

spin

Spin is a popular open-source software verification tool
C
5
star
21

nhc98

nhc98 is a small, easy to install, standards-compliant compiler for Haskell 98, the lazy functional programming language.
C
4
star
22

lfscript

Linux From Script (or 'LFScript') is an unofficial alternative for 'Automated Linux From Scratch
Shell
4
star
23

kalmar

Kalmar : An open source C++ compiler for heterogeneous devices
C++
4
star
24

codisasm

Assembly
4
star
25

ted

Ted is a small and powerful text editor for X window
C
4
star
26

openvmtil

https://code.google.com/p/openvmtil/
C
3
star
27

AutoCorres

AutoCorres : Automatic Specification Abstraction
Isabelle
3
star
28

qscm

A tiny bootstrapped Scheme
C
3
star
29

aoeui

Automatically exported from code.google.com/p/aoeui
C
3
star
30

KaarPux

Linux build from source
Shell
3
star
31

cmm

You would be much happier with one portable assembly language that could be generated by a front end and implemented by any of several code generators. C minus minus is that language.
Assembly
3
star
32

xedit

Simple text editor for X
C
3
star
33

b0

http://b0language.sourceforge.net/
C
3
star
34

ravm

RAVM: compile once and run anywhere https://zsmith.co/ravm.html
C
2
star
35

subc

SubC is a fast and simple public domain compiler for a clean subset of the C programming language. It can compile itself and passes gcc -Wall -pedantic. Current version: subc-20141207.tgz
C
2
star
36

inferno

inferno on android/firefox-os
C
2
star
37

os64

C
2
star
38

cucu

CUCU is a toy compiler for a toy language.
C
2
star
39

dmenu

suckless dmenu
C
2
star
40

agg

Anti-Grain Geometry Library
C++
2
star
41

sam

Sam is a powerful X-based text editor
C
1
star
42

assembly

Assembly
1
star
43

b0-0.0.19

b0 version 0.0.19
C
1
star
44

one

assembly virtual run
1
star
45

crw

a small compiler written in D
D
1
star
46

kwin

KWin Wayland compositor
C++
1
star
47

ark-c

Really old Ark compiler written in C
C
1
star
48

pluto

C
1
star
49

qcc

A tiny C compiler
OCaml
1
star
50

wl

windowlab http://nickgravgaard.com/windowlab/
C
1
star
51

gdsl

GDSL: A Generic Decoder Specification Language for Interpreting Machine Language
Standard ML
1
star
52

crom

chromium dissect
C++
1
star
53

nf

C
1
star
54

gala

Meet Gala: The Window Manager http://elementary.io/journal/meet-gala-window-manager
Groff
1
star
55

YESS

A CPU simulator written in C that processes a subset of the X86 instruction set.
C
1
star
56

AntTweakBar

C++
1
star
57

fi

simple editor
C++
1
star
58

edit

A Relaxing Mix of Vi & Acme
C
1
star
59

dcc

dcc decompiler by Cristina Cifuentes
C
1
star
60

x64lab.lastalfa

Automatically exported from code.google.com/p/x64lab.lastalfa
Assembly
1
star
61

clue

C
1
star
62

pemagine

1
star
63

Petitboot

A kexec based bootloader
C
1
star
64

go-learn

initial version go lang code study
Go
1
star
65

HrwCC

hrwcc: a self-compiling C compiler
Prolog
1
star
66

minimalforth

Assembly
1
star
67

sparse

C
1
star
68

barrelfish

Barrelfish is a new research operating system being built from scratch and released by ETH Zurich in Switzerland
C
1
star