org-ml
A functional API for org-mode inspired by @magnars's dash.el and s.el libraries.
Upcoming Breaking Changes
org-ml-get/set/map-affiliated-keyword
andorg-ml-set-caption!
have been merged withorg-ml-get/set/map-property
and will be removed in a later revisionorg-ml-do-(some-)headlines
,org-ml-do-(some-)subtrees
,org-ml-get-(some-)headlines
, andorg-ml-get-(some-)subtrees
are now depreciated. Useorg-ml-parse-headlines
,org-ml-parse-subtrees
,org-ml-update-headlines
, andorg-ml-update-subtres
instead.
Installation
Install from MELPA:
M-x package-install RET org-ml RET
Alternatively, clone this repository to somewhere in your load path:
git clone https://github.com/ndwarshuis/org-ml ~/somewhere/in/load/path
Then require in your emacs config:
(require 'org-ml)
Dependencies
- emacs (27.2, 27.1, 28.1)
- org-mode (9.5, 9.4, 9.3)
- dash
- s
Explicit versions noted above have been tested. Other versions may work but are not currently supported.
Motivation
Org-mode comes with a powerful, built-in parse-tree generator specified in
org-element.el
. The generated parse-tree is simply a heavily-nested list which
can be easily manipulated using (mostly pure) functional code. This contrasts
the majority of functions normally used to interface with org-mode files, which
are imperative in nature (org-insert-headine
, outline-next-heading
, etc) as
they depend on the mutable state of Emacs buffers. In general, functional code
is
(arguably)
more robust, readable, and testable, especially in use-cases such as this where
a stateless abstract data structure is being transformed and queried.
The org-element.el
provides a minimal API for handling this parse-tree in a
functional manner, but does not provide higher-level functions necessary for
intuitive, large-scale use. The org-ml
package is designed to provide this
API. Furthermore, it is highly compatible with the dash.el
package, which is a
generalized functional library for emacs-lisp.
Org-Element Overview
Parsing a buffer with the function org-element-parse-buffer
will yield a parse
tree composed of nodes. Nodes have types and properties associated with them.
See the org-element API
documentation for
a list of all node types and their properties (also see the terminology
conventions and property omissions used in this
package).
Each node is represented by a list where the first member is the type and the second member is a plist describing the node's properties:
(type (:prop1 value1 :prop2 value2 ...))
Node types may be either leaves or branches, where branches may have zero or more child nodes and leaves may not have child nodes at all. Leaves will always have lists of the form shown above. Branches, on the other hand, have their children appended to the end:
(type (:prop1 value1 :prop2 value2) child1 child2 ...)
In addition to leaves and branches, node types can belong to one of two classes:
- Objects: roughly correspond to raw, possibly-formatted text
- Elements: more complex structures which may be built from objects
Within the branch node types, there are restrictions of which class is allowed to be a child depending on the type. There are three of these restrictions:
- Branch element with child elements (aka 'greater elements'): these are element types that are generally nestable inside one another (eg headlines, plain-lists, items)
- Branch elements with child objects (aka 'object containers'): these are element types that hold textual information (eg paragraph)
- Branch objects with child objects (aka 'recursive objects'): these are object types used primarily for text formating (bold, italic, underline, etc)
Note: it is never allowed for an element type to be a child of a branch object type.
Conventions
Terminology
This package takes several deviations from the original terminology found in
org-element.el
.
- 'node' is used here to describe a vertex in the parse tree, where 'element'
and 'object' are two classes used to describe said vertex (
org-element.el
seems to use 'element' to generally mean 'node' and uses 'object' to further specify) - 'child' and 'children' are used here instead of 'content' and 'contents'
- 'branch' is used here instead of 'container'. Furthermore, 'leaf' is used to
describe the converse of 'branch' (there does not seem to be an equivalent
term in
org-element.el
) org-element.el
uses 'attribute(s)' and 'property(ies)' interchangeably to describe nodes; here only 'property(ies)' is used
Properties
All properties specified by org-element.el
are readable by this API (eg one
can query them with functions like om-get-property
).
The properties :begin
, :end
, :contents-begin
, :contents-end
, :parent
,
and post-affiliated
are not settable by this API as they are not necessary for
manipulating the textual representation of the parse tree. In addition to these,
some properties unique to certain types are not settable for the same reason.
Each type's build function describes the properties that are settable.
See org-ml-remove-parent
and org-ml-remove-parents
for specific information
and functions regarding the :parent
property, why it can be annoying, when you
would want to remove it.
Threading
Each function that operates on an element/object will take the element/object as
its right-most argument. This allows convenient function chaining using
dash.el
's right-threading operators (->>
and -some->>
). The examples in
the API reference almost exclusively demonstrate this
pattern. Additionally, the right-argument convention also allows convenient
partial application using -partial
from dash.el
.
Higher-order functions
Higher-order functions (functions that take other functions as arguments) have two forms. The first takes a (usually unary) function and applies it:
(om-map-property :value (lambda (s) (concat "foo" s)) node)
(om-map-property :value (-partial concat "foo") node)
This can equivalently be written using an anaphoric form where the original
function name is appended with *
. The symbol it
carries the value of the
unary argument (unless otherwise specified):
(om-map-property* :value (concat "foo" it) node)
Side effect functions
All functions that read and write from buffers are named like
om-OPERATION-THING-at
where OPERATION
is some operation to be performed on
THING
in the current buffer. All these functions take point
as one of their
arguments to denote where in the buffer to perform OPERATION
.
All of these functions have current-point convenience analogues that are named
as om-OPERATION-this-THING
where OPERATION
and THING
carry the same
meaning, but OPERATION
is done at the current point and point
is not an
argument to the function.
For the sake of brevity, only the former form of these functions are given in the API reference.
Usage
For comprehensive documentation of all available functions see the API reference.
Habits
By default, the org-element API does not parse timestamp habits. This means that
if you parse an org-mode buffer with timestamp habits and try to convert it back
to a string, the habits will be lost. org-ml
has a wrapper function to add
this functionality; enable it by setting org-ml-parse-habits
to t. Since
habits are an extension of timestamp repeaters, this option will also impact the
behavior of org-ml-timestamp-get-repeater
, org-ml-timestamp-set-repeater
,
and org-ml-timestamp-map-repeater
(see their docstrings for details).
Performance
Benchmarking this library is still in the early stages.
Intuitively, the most costly operations are going to be those that go
back-and-forth between raw buffer text (here called "buffer space") and its node
representations (here called "node space") since those involve complicated
string formating, regular expressions, buffer searching, etc (examples:
org-ml-parse-this-THING
, org-ml-update-this-THING
and friends). Once the
data is in node space, execution should be very fast since nodes are just lists.
Thus if you have performance-intensive code that requires many small edits to
org-mode files, it might be better to use org-mode's built-in functions. On the
other hand, if most of the complicated processing can be done in node space
while limiting the number of conversions to/from buffer space, org-ml
will be
much faster.
To be more scientific, the current tests in the suite (see
here) seem to support the following conclusions
when comparing org-ml
to equivalent code written using built-in org-mode
functions (in line with the intuitions above):
- reading data (a one way conversion from buffer to node space) is up to an order of magnitude slower, specifically when the data to be obtained isn't very large (eg, reading the TODO state from a headline)
- manipulating text (going from buffer to node space, then modifying the node, then going back to buffer space) is several times slower for single modifications (eg setting the TODO state of a headline)
- larger numbers of manipulations on one node at once are faster (eg changing the TODO state, setting a property, and setting a SCHEDULED timestamp on a headline)
To run the benchmark suite:
make benchmark
Memoization
For all pattern-matching functions (eg org-ml-match
and org-ml-match-X
), the
PATTERN
parameter is processed into a lambda function which computationally
carries out the pattern matching. If there are many calls using the same or a
few unique patterns, this lambda-generation overhead may be memoized by setting
org-ml-memoize-match-patterns
. See this varible's documentation for details.
Version History
See changelog.