• Stars
    star
    690
  • Rank 63,207 (Top 2 %)
  • Language
    C
  • License
    MIT License
  • Created about 5 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python bindings to the Tree-sitter parsing library

py-tree-sitter

Build Status Build status

This module provides Python bindings to the tree-sitter parsing library.

Installation

This package currently only works with Python 3. There are no library dependencies, but you do need to have a C compiler installed.

pip3 install tree_sitter

Usage

Setup

First you'll need a Tree-sitter language implementation for each language that you want to parse. You can clone some of the existing language repos or create your own:

git clone https://github.com/tree-sitter/tree-sitter-go
git clone https://github.com/tree-sitter/tree-sitter-javascript
git clone https://github.com/tree-sitter/tree-sitter-python

Use the Language.build_library method to compile these into a library that's usable from Python. This function will return immediately if the library has already been compiled since the last time its source code was modified:

from tree_sitter import Language, Parser

Language.build_library(
  # Store the library in the `build` directory
  'build/my-languages.so',

  # Include one or more languages
  [
    'vendor/tree-sitter-go',
    'vendor/tree-sitter-javascript',
    'vendor/tree-sitter-python'
  ]
)

Load the languages into your app as Language objects:

GO_LANGUAGE = Language('build/my-languages.so', 'go')
JS_LANGUAGE = Language('build/my-languages.so', 'javascript')
PY_LANGUAGE = Language('build/my-languages.so', 'python')

Basic Parsing

Create a Parser and configure it to use one of the languages:

parser = Parser()
parser.set_language(PY_LANGUAGE)

Parse some source code:

tree = parser.parse(bytes("""
def foo():
    if bar:
        baz()
""", "utf8"))

If you have your source code in some data structure other than a bytes object, you can pass a "read" callable to the parse function.

The read callable can use either the byte offset or point tuple to read from buffer and return source code as bytes object. An empty bytes object or None terminates parsing for that line. The bytes must encode the source as UTF-8.

For example, to use the byte offset:

src = bytes("""
def foo():
    if bar:
        baz()
""", "utf8")

def read_callable(byte_offset, point):
    return src[byte_offset:byte_offset+1]

tree = parser.parse(read_callable)

And to use the point:

src_lines = ["def foo():\n", "    if bar:\n", "        baz()"]

def read_callable(byte_offset, point):
    row, column = point
    if row >= len(src_lines) or column >= len(src_lines[row]):
        return None
    return src_lines[row][column:].encode('utf8')

tree = parser.parse(read_callable)

Inspect the resulting Tree:

root_node = tree.root_node
assert root_node.type == 'module'
assert root_node.start_point == (1, 0)
assert root_node.end_point == (3, 13)

function_node = root_node.children[0]
assert function_node.type == 'function_definition'
assert function_node.child_by_field_name('name').type == 'identifier'

function_name_node = function_node.children[1]
assert function_name_node.type == 'identifier'
assert function_name_node.start_point == (1, 4)
assert function_name_node.end_point == (1, 7)

assert root_node.sexp() == "(module "
    "(function_definition "
        "name: (identifier) "
        "parameters: (parameters) "
        "body: (block "
            "(if_statement "
                "condition: (identifier) "
                "consequence: (block "
                    "(expression_statement (call "
                        "function: (identifier) "
                        "arguments: (argument_list))))))))"

Walking Syntax Trees

If you need to traverse a large number of nodes efficiently, you can use a TreeCursor:

cursor = tree.walk()

assert cursor.node.type == 'module'

assert cursor.goto_first_child()
assert cursor.node.type == 'function_definition'

assert cursor.goto_first_child()
assert cursor.node.type == 'def'

# Returns `False` because the `def` node has no children
assert not cursor.goto_first_child()

assert cursor.goto_next_sibling()
assert cursor.node.type == 'identifier'

assert cursor.goto_next_sibling()
assert cursor.node.type == 'parameters'

assert cursor.goto_parent()
assert cursor.node.type == 'function_definition'

Editing

When a source file is edited, you can edit the syntax tree to keep it in sync with the source:

tree.edit(
    start_byte=5,
    old_end_byte=5,
    new_end_byte=5 + 2,
    start_point=(0, 5),
    old_end_point=(0, 5),
    new_end_point=(0, 5 + 2),
)

Then, when you're ready to incorporate the changes into a new syntax tree, you can call Parser.parse again, but pass in the old tree:

new_tree = parser.parse(new_source, tree)

This will run much faster than if you were parsing from scratch.

The Tree.get_changed_ranges method can be called on the old tree to return the list of ranges whose syntactic structure has been changed:

for changed_range in tree.get_changed_ranges(new_tree):
    print('Changed range:')
    print(f'  Start point {changed_range.start_point}')
    print(f'  Start byte {changed_range.start_byte}')
    print(f'  End point {changed_range.end_point}')
    print(f'  End byte {changed_range.end_byte}')

Pattern-matching

You can search for patterns in a syntax tree using a tree query:

query = PY_LANGUAGE.query("""
(function_definition
  name: (identifier) @function.def)

(call
  function: (identifier) @function.call)
""")

captures = query.captures(tree.root_node)
assert len(captures) == 2
assert captures[0][0] == function_name_node
assert captures[0][1] == "function.def"

The Query.captures() method takes optional start_point, end_point, start_byte and end_byte keyword arguments which can be used to restrict the query's range. Only one of the ..._byte or ..._point pairs need to be given to restrict the range. If all are omitted, the entire range of the passed node is used.

More Repositories

1

tree-sitter

An incremental parsing system for programming tools
Rust
16,473
star
2

node-tree-sitter

Node.js bindings for tree-sitter
C++
464
star
3

tree-sitter-rust

Rust grammar for tree-sitter
JavaScript
318
star
4

tree-sitter-typescript

TypeScript grammar for tree-sitter
JavaScript
272
star
5

tree-sitter-javascript

Javascript grammar for tree-sitter
JavaScript
272
star
6

tree-sitter-python

Python grammar for tree-sitter
JavaScript
261
star
7

tree-sitter-go

Go grammar for tree-sitter
JavaScript
224
star
8

tree-sitter-cpp

C++ grammar for tree-sitter
JavaScript
198
star
9

tree-sitter-c

C grammar for tree-sitter
JavaScript
175
star
10

tree-sitter-c-sharp

C# Grammar for tree-sitter
JavaScript
172
star
11

tree-sitter-graph

Construct graphs from parsed source code
Rust
163
star
12

tree-sitter-scala

Scala grammar for tree-sitter
JavaScript
153
star
13

tree-sitter-bash

Bash grammar for tree-sitter
JavaScript
152
star
14

tree-sitter-haskell

Haskell grammar for tree-sitter.
C
146
star
15

tree-sitter-ruby

Ruby grammar for tree-sitter
JavaScript
137
star
16

haskell-tree-sitter

Haskell bindings for tree-sitter
Haskell
137
star
17

tree-sitter-java

Java grammar for tree-sitter
JavaScript
86
star
18

tree-sitter-php

PHP grammar for tree-sitter
JavaScript
86
star
19

tree-sitter-json

JSON grammar for tree-sitter
Makefile
86
star
20

tree-sitter-verilog

SystemVerilog grammar for tree-sitter
C
85
star
21

tree-sitter-html

HTML grammar for Tree-sitter
C++
82
star
22

tree-sitter-julia

Julia grammar for Tree-sitter
JavaScript
78
star
23

tree-sitter-ocaml

OCaml grammar for tree-sitter
JavaScript
71
star
24

tree-sitter-css

CSS grammar for Tree-sitter
JavaScript
66
star
25

ruby-tree-sitter.old

Ruby bindings to tree-sitter
C
60
star
26

tree-sitter-swift

Swift grammar for tree-sitter
JavaScript
56
star
27

tree-sitter-regex

Tree-sitter parser for regular expressions
JavaScript
47
star
28

tree-sitter-cli

CLI tool for creating and testing tree-sitter parsers
JavaScript
43
star
29

tree-sitter-embedded-template

Tree-sitter grammar for embedded template languages like ERB, EJS
C
34
star
30

rust-tree-sitter

Rust bindings to Tree-sitter
Rust
30
star
31

tree-sitter-agda

Agda grammar for tree-sitter
Yacc
29
star
32

tree-sitter-jsdoc

JSDoc grammar for Tree-sitter
Rust
20
star
33

tree-sitter-ql

tree-sitter grammar for the CodeQL language
JavaScript
15
star
34

tree-sitter.github.io

Source HTML for the Tree-sitter organization site
JavaScript
10
star
35

highlight-schema

Schema for syntax highlighting property sheets
JavaScript
7
star
36

csharp-tree-sitter

C# bindings to the Tree-sitter parsing library
C#
6
star
37

tree-sitter-tsq

tree-sitter grammar for the tree-sitter query language
JavaScript
6
star
38

afl-tree-sitter

AFL test harness for tree-sitter runtime and parsers
C
5
star
39

tree-sitter-fluent

JavaScript
4
star
40

tree-sitter-razor

(WIP) C# Razor grammar for tree-sitter
C
4
star
41

tree-sitter-ql-dbscheme

tree-sitter support for `.dbscheme` files (as used in CodeQL).
C
2
star
42

.github

Tree-sitter organization info
1
star