• Stars
    star
    108
  • Rank 321,259 (Top 7 %)
  • Language
    Rust
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

-ast-dump=json

github crates.io docs.rs build status

This library provides deserialization logic for efficiently processing Clang's -ast-dump=json format from Rust.

[dependencies]
clang-ast = "0.1"

Format overview

An AST dump is generated by a compiler command like:

$  clang++ -Xclang -ast-dump=json -fsyntax-only path/to/source.cc

The high-level structure is a tree of nodes, each of which has an "id" and a "kind", zero or more further fields depending on what the node kind is, and finally an optional "inner" array of child nodes.

As an example, for an input file containing just the declaration class S;, the AST would be as follows:

{
  "id": "0x1fcea38",                 //<-- root node
  "kind": "TranslationUnitDecl",
  "inner": [
    {
      "id": "0xadf3a8",              //<-- first child node
      "kind": "CXXRecordDecl",
      "loc": {
        "offset": 6,
        "file": "source.cc",
        "line": 1,
        "col": 7,
        "tokLen": 1
      },
      "range": {
        "begin": {
          "offset": 0,
          "col": 1,
          "tokLen": 5
        },
        "end": {
          "offset": 6,
          "col": 7,
          "tokLen": 1
        }
      },
      "name": "S",
      "tagUsed": "class"
    }
  ]
}

Library design

By design, the clang-ast crate does not provide a single great big data structure that exhaustively covers every possible field of every possible Clang node type. There are three major reasons:

  • Performance โ€” these ASTs get quite large. For a reasonable mid-sized translation unit that includes several platform headers, you can easily get an AST that is tens to hundreds of megabytes of JSON. To maintain performance of downstream tooling built on the AST, it's critical that you deserialize only the few fields which are directly required by your use case, and allow Serde's deserializer to efficiently ignore all the rest.

  • Stability โ€” as Clang is developed, the specific fields associated with each node kind are expected to change over time in non-additive ways. This is nonproblematic because the churn on the scale of individual nodes is minimal (maybe one change every several years). However, if there were a data structure that promised to be able to deserialize every possible piece of information in every node, practically every change to Clang would be a breaking change to some node somewhere despite your tooling not caring anything at all about that node kind. By deserializing only those fields which are directly relevant to your use case, you become insulated from the vast majority of syntax tree changes.

  • Compile time โ€” a typical use case involves inspecting only a tiny fraction of the possible nodes or fields, on the order of 1%. Consequently your code will compile 100ร— faster than if you tried to include everything in the data structure.


Data structures

The core data structure of the clang-ast crate is Node<T>.

pub struct Node<T> {
    pub id: Id,
    pub kind: T,
    pub inner: Vec<Node<T>>,
}

The caller must provide their own kind type T, which is an enum or struct as described below. T determines exactly what information the clang-ast crate will deserialize out of the AST dump.

By convention you should name your T type Clang.


T = enum

Most often, you'll want Clang to be an enum. In this case your enum must have one variant per node kind that you care about. The name of each variant matches the "kind" entry seen in the AST.

Additionally there must be a fallback variant, which must be named either Unknown or Other, into which clang-ast will put all tree nodes not matching one of the expected kinds.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl { name: Option<String> },
    EnumDecl { name: Option<String> },
    EnumConstantDecl { name: String },
    Other,
}

fn main() {
    let json = std::fs::read_to_string("ast.json").unwrap();
    let node: Node = serde_json::from_str(&json).unwrap();

}

The above is a simple example with variants for processing "kind": "NamespaceDecl",โ€‚"kind": "EnumDecl",โ€‚and "kind": "EnumConstantDecl" nodes. This is sufficient to extract the set of variants of every enum in the translation unit, and the enums' namespace (possibly anonymous) and enum name (possibly anonymous).

Newtype variants are fine too, particularly if you'll be deserializing more than one field for some nodes.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl(NamespaceDecl),
    EnumDecl(EnumDecl),
    EnumConstantDecl(EnumConstantDecl),
    Other,
}

#[derive(Deserialize, Debug)]
pub struct NamespaceDecl {
    pub name: Option<String>,
}

#[derive(Deserialize, Debug)]
pub struct EnumDecl {
    pub name: Option<String>,
}

#[derive(Deserialize, Debug)]
pub struct EnumConstantDecl {
    pub name: String,
}

T = struct

Rarely, it can make sense to instantiate Node with Clang being a struct type, instead of an enum. This allows for deserializing a uniform group of data out of every node in the syntax tree.

The following example struct collects the "loc" and "range" of every node if present; these fields provide the file name / line / column position of nodes. Not every node kind contains this information, so we use Option to collect it for just the nodes that have it.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub struct Clang {
    pub kind: String,  // or clang_ast::Kind
    pub loc: Option<clang_ast::SourceLocation>,
    pub range: Option<clang_ast::SourceRange>,
}

If you really need, it's also possible to store every other piece of key/value information about every node via a weakly typed Map<String, Value> and the Serde flatten attribute.

use serde::Deserialize;
use serde_json::{Map, Value};

#[derive(Deserialize)]
pub struct Clang {
    pub kind: String,  // or clang_ast::Kind
    #[serde(flatten)]
    pub data: Map<String, Value>,
}

Hybrid approach

To deserialize kind-specific information about a fixed set of node kinds you care about, as well as some uniform information about every other kind of node, you can use a hybrid of the two approaches by giving your Other / Unknown fallback variant some fields.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl(NamespaceDecl),
    EnumDecl(EnumDecl),
    Other {
        kind: clang_ast::Kind,
    },
}

Source locations

Many node kinds expose the source location of the corresponding source code tokens, which includes:

  • the filepath at which they're located;
  • the chain of #includes by which that file was brought into the translation unit;
  • line/column positions within the source file;
  • macro expansion trace for tokens constructed by expansion of a C preprocessor macro.

You'll find this information in fields called "loc" and/or "range" in the JSON representation.

{
  "id": "0x1251428",
  "kind": "NamespaceDecl",
  "loc": {                           //<--
    "offset": 7004,
    "file": "/usr/include/x86_64-linux-gnu/c++/10/bits/c++config.h",
    "line": 258,
    "col": 11,
    "tokLen": 3,
    "includedFrom": {
      "file": "/usr/include/c++/10/utility"
    }
  },
  "range": {                         //<--
    "begin": {
      "offset": 6994,
      "col": 1,
      "tokLen": 9
    },
    "end": {
      "offset": 7155,
      "line": 266,
      "col": 1,
      "tokLen": 1
    }
  },
  ...
}

The naive deserialization of these structures is challenging to work with because Clang uses field omission to mean "same as previous". So if a "loc" is printed without a "file" inside, it means the loc is in the same file as the immediately previous loc in serialization order.

The clang-ast crate provides types for deserializing this source location information painlessly, producing Arc<str> as the type of filepaths which may be shared across multiple source locations.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl(NamespaceDecl),
    Other,
}

#[derive(Deserialize, Debug)]
pub struct NamespaceDecl {
    pub name: Option<String>,
    pub loc: clang_ast::SourceLocation,    //<--
    pub range: clang_ast::SourceRange,     //<--
}

Node identifiers

Every syntax tree node has an "id". In JSON it's the memory address of Clang's internal memory allocation for that node, serialized to a hex string.

The AST dump uses ids as backreferences in nodes of directed acyclic graph nature. For example the following MemberExpr node is part of the invocation of an operator bool conversion, and thus its syntax tree refers to the resolved operator bool conversion function declaration:

{
  "id": "0x9918b88",
  "kind": "MemberExpr",
  "valueCategory": "rvalue",
  "referencedMemberDecl": "0x12d8330",     //<--
  ...
}

The node it references, with memory address 0x12d8330, is found somewhere earlier in the syntax tree:

{
  "id": "0x12d8330",                       //<--
  "kind": "CXXConversionDecl",
  "name": "operator bool",
  "mangledName": "_ZNKSt17integral_constantIbLb1EEcvbEv",
  "type": {
    "qualType": "std::integral_constant<bool, true>::value_type () const noexcept"
  },
  "constexpr": true,
  ...
}

Due to the ubiquitous use of ids for backreferencing, it is valuable to deserialize them not as strings but as a 64-bit integer. The clang-ast crate provides an Id type for this purpose, which is cheaply copyable, hashable, and comparible more cheaply than a string. You may find yourself with lots of hashtables keyed on Id.


License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

More Repositories

1

cxx

Safe interop between Rust and C++
Rust
5,106
star
2

anyhow

Flexible concrete Error type built on std::error::Error
Rust
4,193
star
3

thiserror

derive(Error) for struct and enum error types
Rust
3,352
star
4

proc-macro-workshop

Learn to write Rust procedural macrosโ€ƒโ€ƒ[Rust Latam conference, Montevideo Uruguay, March 2019]
Rust
2,988
star
5

syn

Parser for Rust source code
Rust
2,681
star
6

cargo-expand

Subcommand to show result of macro expansion
Rust
2,644
star
7

async-trait

Type erasure for async trait methods
Rust
1,495
star
8

case-studies

Analysis of various tricky Rust code
Rust
1,340
star
9

rust-quiz

Medium to hard Rust questions with explanations
Rust
1,318
star
10

quote

Rust quasi-quoting
Rust
1,231
star
11

watt

Runtime for executing procedural macros as WebAssembly
Rust
1,062
star
12

typetag

Serde serializable and deserializable trait objects
Rust
888
star
13

paste

Macros for all your token pasting needs
Rust
852
star
14

serde-yaml

Strongly typed YAML library for Rust
Rust
804
star
15

no-panic

Attribute macro to require that the compiler prove a function can't ever panic
Rust
758
star
16

inventory

Typed distributed plugin registration
Rust
714
star
17

rust-toolchain

Concise GitHub Action for installing a Rust toolchain
Shell
621
star
18

trybuild

Test harness for ui tests of compiler diagnostics
Rust
615
star
19

miniserde

Data structure serialization library with several opposite design goals from Serde
Rust
612
star
20

reflect

Compile-time reflection API for developing robust procedural macros (proof of concept)
Rust
602
star
21

request-for-implementation

Crates that don't exist, but should
597
star
22

proc-macro2

Rust
545
star
23

indoc

Indented document literals for Rust
Rust
537
star
24

prettyplease

A minimal `syn` syntax tree pretty-printer
Rust
517
star
25

erased-serde

Type-erased Serialize, Serializer and Deserializer traits
Rust
503
star
26

semver

Parser and evaluator for Cargo's flavor of Semantic Versioning
Rust
500
star
27

dyn-clone

Clone trait that is object-safe
Rust
486
star
28

ryu

Fast floating point to string conversion
Rust
471
star
29

linkme

Safe cross-platform linker shenanigans
Rust
399
star
30

cargo-llvm-lines

Count lines of LLVM IR per generic function
Rust
398
star
31

semver-trick

How to avoid complicated coordinated upgrades
Rust
383
star
32

efg

Conditional compilation using boolean expression syntax, rather than any(), all(), not()
Rust
297
star
33

rust-faq

Frequently Asked Questions ยท The Rust Programming Language
262
star
34

rustversion

Conditional compilation according to rustc compiler version
Rust
256
star
35

itoa

Fast function for printing integer primitives to a decimal string
Rust
248
star
36

path-to-error

Find out path at which a deserialization error occurred
Rust
241
star
37

cargo-tally

Graph the number of crates that depend on your crate over time
Rust
212
star
38

proc-macro-hack

Procedural macros in expression position
Rust
203
star
39

monostate

Type that deserializes only from one specific value
Rust
194
star
40

colorous

Color schemes for charts and maps
Rust
193
star
41

readonly

Struct fields that are made read-only accessible to other modules
Rust
187
star
42

dissimilar

Diff library with semantic cleanup, based on Google's diff-match-patch
Rust
175
star
43

star-history

Graph history of GitHub stars of a user or repo over time
Rust
156
star
44

ref-cast

Safely cast &T to &U where the struct U contains a single field of type T.
Rust
154
star
45

automod

Pull in every source file in a directory as a module
Rust
129
star
46

inherent

Make trait methods callable without the trait in scope
Rust
128
star
47

ghost

Define your own PhantomData
Rust
115
star
48

faketty

Wrapper to exec a command in a pty, even if redirecting the output
Rust
113
star
49

dtoa

Fast functions for printing floating-point primitives to a decimal string
Rust
110
star
50

seq-macro

Macro to repeat sequentially indexed copies of a fragment of code
Rust
102
star
51

remain

Compile-time checks that an enum or match is written in sorted order
Rust
99
star
52

mashup

Concatenate identifiers in a macro invocation
Rust
96
star
53

noisy-clippy

Rust
84
star
54

tt-call

Token tree calling convention
Rust
77
star
55

basic-toml

Minimal TOML library with few dependencies
Rust
76
star
56

squatternaut

A snapshot of name squatting on crates.io
Rust
73
star
57

serde-ignored

Find out about keys that are ignored when deserializing data
Rust
68
star
58

enumn

Convert number to enum
Rust
66
star
59

bootstrap

Bootstrapping rustc from source
Shell
62
star
60

essay

docs.rs as a publishing platform?
Rust
62
star
61

db-dump

Library for scripting analyses against crates.io's database dumps
Rust
60
star
62

scratch

Compile-time temporary directory shared by multiple crates and erased by `cargo clean`
Rust
59
star
63

gflags

Command line flags library that does not require a central list of all the flags
Rust
55
star
64

install

Fast `cargo install` action using a GitHub-based binary cache
Shell
55
star
65

serde-starlark

Serde serializer for generating Starlark build targets
Rust
53
star
66

oqueue

Non-interleaving multithreaded output queue
Rust
53
star
67

build-alert

Rust
51
star
68

unicode-ident

Determine whether characters have the XID_Start or XID_Continue properties
Rust
51
star
69

lalrproc

Proof of concept of procedural macro input parsed by LALRPOP
Rust
50
star
70

dragonbox

Rust
50
star
71

sha1dir

Checksum of a directory tree
Rust
38
star
72

hackfn

Fake implementation of `std::ops::Fn` for user-defined data types
Rust
38
star
73

reduce

iter.reduce(fn) in Rust
Rust
37
star
74

link-cplusplus

Link libstdc++ or libc++ automatically or manually
Rust
36
star
75

argv

Non-allocating iterator over command line arguments
Rust
33
star
76

get-all-crates

Download .crate files of all versions of all crates from crates.io
Rust
31
star
77

threadbound

Make any value Sync but only available on its original thread
Rust
31
star
78

dircnt

Count directory entriesโ€”`ls | wc -l` but faster
Rust
27
star
79

unsafe-libyaml

libyaml transpiled to rust by c2rust
Rust
27
star
80

serde-stacker

Serializer and Deserializer adapters that avoid stack overflows by dynamically growing the stack
Rust
27
star
81

cargo-unlock

Remove Cargo.lock lockfile
Rust
25
star
82

respan

Macros to erase scope information from tokens
Rust
24
star
83

isatty

libc::isatty that also works on Windows
Rust
21
star
84

iota

Related constants in Rust: 1 << iota
Rust
20
star
85

foreach

18
star
86

bufsize

bytes::BufMut implementation to count buffer size
Rust
18
star
87

hire

How to hire dtolnay
18
star
88

precise

Full precision decimal representation of f64
Rust
17
star
89

dashboard

15
star
90

rustflags

Parser for CARGO_ENCODED_RUSTFLAGS
Rust
13
star
91

libfyaml-rs

Rust binding for libfyaml
Rust
11
star
92

install-buck2

Install precompiled Buck2 build system
6
star
93

mailingset

Set-algebraic operations on mailing lists
Python
5
star
94

.github

5
star
95

jq-gdb

gdb pretty-printer for jv objects
Python
1
star