Natural Language Concrete Syntax Tree format.
nlcst is a specification for representing natural language in a syntax tree. It implements the unist spec.
This document may not be released.
See releases for released documents.
The latest released version is 1.0.2
.
Contents
- Introduction
- Types
- Nodes (abstract)
- Nodes
- Glossary
- List of utilities
- Related
- References
- Contribute
- Acknowledgments
- License
Introduction
This document defines a format for representing natural language as a concrete syntax tree. Development of nlcst started in May 2014, in the now deprecated textom project for retext, before unist existed. This specification is written in a Web IDL-like grammar.
Where this specification fits
nlcst extends unist, a format for syntax trees, to benefit from its ecosystem of utilities.
nlcst relates to JavaScript in that it has an ecosystem of utilities for working with compliant syntax trees in JavaScript. However, nlcst is not limited to JavaScript and can be used in other programming languages.
nlcst relates to the unified and retext projects in that nlcst syntax trees are used throughout their ecosystems.
Types
If you are using TypeScript, you can use the nlcst types by installing them with npm:
npm install @types/nlcst
Nodes (abstract)
Literal
interface Literal <: UnistLiteral {
value: string
}
Literal (UnistLiteral) represents a node in nlcst containing a value.
Its value
field is a string
.
Parent
interface Parent <: UnistParent {
children: [Paragraph | Punctuation | Sentence | Source | Symbol | Text | WhiteSpace | Word]
}
Parent (UnistParent) represents a node in nlcst containing other nodes (said to be children).
Its content is limited to only other nlcst content.
Nodes
Paragraph
interface Paragraph <: Parent {
type: 'ParagraphNode'
children: [Sentence | Source | WhiteSpace]
}
Paragraph (Parent) represents a unit of discourse dealing with a particular point or idea.
Paragraph can be used in a root node. It can contain sentence, whitespace, and source nodes.
Punctuation
interface Punctuation <: Literal {
type: 'PunctuationNode'
}
Punctuation (Literal) represents typographical devices which aid understanding and correct reading of other grammatical units.
Punctuation can be used in sentence or word nodes.
Root
interface Root <: Parent {
type: 'RootNode'
}
Root (Parent) represents a document.
Root can be used as the root of a tree, never as a child. Its content model is not limited, it can contain any nlcst content, with the restriction that all content must be of the same category.
Sentence
interface Sentence <: Parent {
type: 'SentenceNode'
children: [Punctuation | Source | Symbol | WhiteSpace | Word]
}
Sentence (Parent) represents grouping of grammatically linked words, that in principle tells a complete thought, although it may make little sense taken in isolation out of context.
Sentence can be used in a paragraph node. It can contain word, symbol, punctuation, whitespace, and source nodes.
Source
interface Source <: Literal {
type: 'SourceNode'
}
Source (Literal) represents an external (ungrammatical) value embedded into a grammatical unit: a hyperlink, code, and such.
Source can be used in root, paragraph, sentence, or word nodes.
Symbol
interface Symbol <: Literal {
type: 'SymbolNode'
}
Symbol (Literal) represents typographical devices different from characters which represent sounds (like letters and numerals), white space, or punctuation.
Symbol can be used in sentence or word nodes.
Text
interface Text <: Literal {
type: 'TextNode'
}
Text (Literal) represents actual content in nlcst documents: one or more characters.
Text can be used in word nodes.
WhiteSpace
interface WhiteSpace <: Literal {
type: 'WhiteSpaceNode'
}
WhiteSpace (Literal) represents typographical devices devoid of content, separating other units.
WhiteSpace can be used in root, paragraph, or sentence nodes.
Word
interface Word <: Parent {
type: 'WordNode'
children: [Punctuation | Source | Symbol | Text]
}
Word (Parent) represents the smallest element that may be uttered in isolation with semantic or pragmatic content.
Word can be used in a sentence node. It can contain text, symbol, punctuation, and source nodes.
Glossary
See the unist glossary.
List of utilities
See the unist list of utilities for more utilities.
nlcst-affix-emoticon-modifier
β merge affix emoticons into the previous sentencenlcst-emoji-modifier
β support emojinlcst-emoticon-modifier
β support emoticonsnlcst-is-literal
β check whether a node is meant literallynlcst-normalize
β normalize a word for easier comparisonnlcst-search
β search for patternsnlcst-to-string
β serialize a nodenlcst-test
β validate a nodemdast-util-to-nlcst
β transform mdast to nlcsthast-util-to-nlcst
β transform hast to nlcst
Related
- mdast β Markdown Abstract Syntax Tree format
- hast β Hypertext Abstract Syntax Tree format
- xast β Extensible Abstract Syntax Tree
References
- unist: Universal Syntax Tree. T. Wormer; et al.
- JavaScript: ECMAScript Language Specification. Ecma International.
- Web IDL: Web IDL, C. McCormack. W3C.
Contribute
See contributing.md
in syntax-tree/.github
for
ways to get started.
See support.md
for ways to get help.
Ideas for new utilities and tools can be posted in syntax-tree/ideas
.
A curated list of awesome syntax-tree, unist, mdast, hast, xast, and nlcst resources can be found in awesome syntax-tree.
This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.
Acknowledgments
The initial release of this project was authored by @wooorm.
Thanks to @nwtn, @tmcw, @muraken720, and @dozoisch for contributing to nlcst and related projects!