cluegen - Data Classes From Type Clues
Cluegen is a library that allows you to define data classes using Python type clues. Here's an example of how you use it:
from cluegen import Datum
class Coordinates(Datum):
x: int
y: int
The resulting class works in a well civilised way, providing the
usual __init__()
and __repr__()
methods that you'd normally have
to type out by hand:
>>> a = Coordinates(2, 3)
>>> a
Coordinates(x=2, y=3)
>>> a.x
2
>>> a.y
3
>>>
Inheritance works as well--if you add new attributes in a subclass they get added to the already existing attributes. For example:
class Coordinates3(Coordinates):
z : int
>>> c = Coordinates3(1,2,3)
>>> c
Coordinates3(x=1, y=2, z=3)
>>>
If you're using Python-3.10, you can also use the new match
statement.
def magnitude(c):
match c:
case Coordinates3(x, y, z):
return math.sqrt(x*x + y*y + z*z)
case Coordinates(x, y):
return math.sqrt(x*x + y*y)
It's easy!
Wait, hasn't this already been invented?
At this point, naysayers will be quick to point out that "well,
actually you could just use @dataclass
from the standard library."
Othe.rs migh.t help.fully sugg.est usag.e of the attr.s libr.ary.
And they might have a point. I mean, sure, you could write your class
like this:
from dataclasses import dataclass
@dataclass
class Coordinates:
x: int
y: int
Yes. Yes, you could do that if you wanted your class to be slow to
import, wrapped up by more than 1000 lines of tangled decorator magic,
and inflexible. Or you could use cluegen! Cluegen is tiny, extensible,
provides the same notational convenience, and results in classes that
import about 20x faster (see the file perf.py
for a benchmark).
Under the hood, cluegen
works by dynamically creating code for
methods such as __init__()
and __repr__()
. This code looks
exactly the same as code you would normally write by hand. It's the
same kind of code that the @dataclass
decorator creates. A notable
feature of cluegen
however, is that all of its code generation is
"lazy." That is, no methods are generated until they're actually
needed during the execution of your program. This substantially
reduces import and startup time for situations where a program might
only be using a subset of the defined data classes. You also don't pay
a penalty for features you aren't using. And even if do use all the
features, it's still faster than dataclasses. Phfft!
Extending Cluegen
cluegen
is customizable in interesting ways. For example, suppose
you wanted to add your own custom code generation method to the
Datum
class. Here's an example of how you could do that:
from cluegen import Datum, cluegen, all_clues
class Mytum(Datum):
@cluegen
def as_dict(cls):
clues = all_clues(cls)
return ('def as_dict(self):\n' +
' return {\n' +
'\n'.join(f' {key!r}: self.{key},\n' for key in clues) +
'}\n')
class Point(Mytum):
x: int
y: int
Now, a test:
>>> p = Point(2,3)
>>> p.as_dict()
{ 'x': 2, 'y': 3 }
>>>
In the above example, the decorated as_dict()
method is presented
the class. In this case, cls
would be Point
. The all_clues()
function is a utility function that collects all type-clues from a
class including those from base classes. For this example, it returns
a dictionary {'x': int, 'y': int}
. The value returned by
as_dict()
is a text-string containing the implementation of the
actual as_dict()
method as it would be if you had written it by
hand. This text string is executed once to create a method that
replaces the decorated version. From that point forward, the class
uses the generated code instead.
cluegen
doesn't have too many other bells and whistles--the entire
implementation is about 100 lines of code. It's something that you
can understand, modify, and play around with.
A True Story
Once, there was this developer. For the sake of this story, let's call him "Dave." As Dave was wont to do, he liked to write compilers. A compiler is a natural place to use something fancy like a dataclass--especially for all of the tree structures. So, Dave did just that:
from dataclasses import dataclass
@dataclass
class Node:
pass
@dataclass
class Expression(Node):
pass
@dataclass
class Statement(Node):
pass
@dataclass
class Integer(Expression):
value: int
@dataclass
class BinOp(Expression):
op: str
left: Expression
right: Expression
@dataclass
class UnaryOp(Expression):
op: str
operand: Expression
@dataclass
class PrintStatement(Statement):
value: Expression
# Example
node = PrintStatement(BinOp('+', Integer(3), BinOp('*', Integer(4), Integer(5))))
This all worked great--better than expected in fact. However, one day, Dave thought it would be useful to add an optional line number attribute to all of the nodes. Naturally, this seemed like something that could be easily done on the base class:
@dataclass
class Node:
lineno:int = None
Dave thought wrong! Dataclasses explode in a fireball if you do this.
No, not optional attributes. Not, base classes. Alas, the only
solution seemed to involve copying a lineno
attribute to end of
every class. If Dave had had a clue about cluegen, he could have easily
solved this problem by just adding a minor tweak to the code generation for __init__()
:
from cluegen import Datum, all_clues, cluegen
class Nodum(Datum):
lineno = None
@cluegen
def __init__(cls):
clues = all_clues(cls)
args = ', '.join(f'{name}={getattr(cls,name)!r}'
if hasattr(cls, name) and not isinstance(getattr(cls, name), types.MemberDescriptorType) else name
for name in clues)
if args:
args += ','
body = '\n'.join(f' self.{name} = {name}' for name in clues)
body += '\n self.lineno = lineno'
return f'def __init__(self, {args} *, lineno=None):\n{body}\n'
class Expression(Nodum):
pass
class Statement(Nodum):
pass
class Integer(Expression):
value: int
class BinOp(Expression):
op: str
left: Expression
right: Expression
class UnaryOp(Expression):
op: str
operand: Expression
class PrintStatement(Statement):
value: Expression
Now, it works exactly as desired:
>>> a = Integer(23)
>>> b = Integer(23, lineno=123)
>>> b.value
23
>>> b.lineno
123
>>>
The moral of this story is that cluegen represents a different kind a power--the power to do what YOU want as opposed what THEY allow. It's all about YOU!
On a Serious Note
On the subject of you, the problem of generating "boilerplate code"
such as __init__()
and __repr__()
methods is something that
commonly arises in practice. In solving that problem you can either
learn to use a tool and accept its limitations and complexities. Or
you can learn to use an approach. cluegen
is very much about the
latter. The whole basic "idea" embodied by cluegen
is implemented in
about 20 lines of code (the @cluegen
decorator). The rest of it is
really just an example of that one idea put into practice.
In the big picture, learning an "approach" is a much more powerful concept. It's much more flexible and you can custom tailor it to exactly what you need. Tools, on the other hand, add dependencies to your project. They also sprout complexity as more and more features get added to accommodate every possible use-case that anyone would ever want to do with the tool. You don't need that.
Makin Your Own Datum Class
The provided Datum
class generates code for a common set of default
methods. You really don't need to use this if you want to go in a
completely different direction. For example, suppose that you
wanted to abandon type hints and generate code based on __slots__
instead. Here's an example of how you could do it:
from cluegen import DatumBase, cluegen
def all_slots(cls):
slots = []
for cls in cls.__mro__:
slots[0:0] = getattr(cls, '__slots__', [])
return slots
class Slotum(DatumBase):
__slots__ = ()
@classmethod
def __init_subclass__(cls):
super().__init_subclass__()
cls.__match_args__ = tuple(all_slots(cls))
@cluegen
def __init__(cls):
slots = all_slots(cls)
return ('def __init__(self, ' + ','.join(slots) + '):\n' +
'\n'.join(f' self.{name} = {name}' for name in slots)
)
@cluegen
def __repr__(cls):
slots = all_slots(cls)
return ('def __repr__(self):\n' +
f' return f"{cls.__name__}(' +
','.join('%s={self.%s!r}' % (name, name) for name in slots) + ')"'
)
Some of the string formatting might take a bit of pondering. However, here is an
example of how you'd use Slotum
:
>>> class Point(Slotum):
... __slots__ = ('x', 'y')
...
>>> p = Point(2,3)
>>> p
Point(x=2,y=3)
>>> class Point3(Point):
... __slots__ = ('z',)
...
>>> p3 = Point3(2,3,4)
>>> p3
Point3(x=2,y=3,z=4)
>>>
Theory of Operation
Cluegen is based on Python's descriptor protocol. In a nutshell, whenever
you access an attribute of a class, Python looks for an object that
implements a magic __get__()
method. If found, it invokes __get__()
with the associated instance and class. Cluegen uses this to generate
code on first-access to special methods such as __init__()
. Here
is an example of the machinery at work.
First, define a class:
>>> class Point(Datum):
... x: int
... y: int
...
>>>
Now, look at the __init__()
method in the class dictionary. You'll
see that's some kind of strange "ClueGen" instance:
>>> Point.__dict__['__init__']
<__main__.ClueGen___init__ object at 0x102ec1240>
>>>
This object represents the "ungenerated" method. If you touch the
__init__
attribute on the class in any way, you'll see the Cluegen
object disappear and be replaced by a proper function:
>>> Point.__init__
<function __init__ at 0x102e208c8>
>>> Point.__dict__['__init__']
<function __init__ at 0x102e208c8>
>>>
This is the basic idea--code generation on first access to an attribute. Inheritance adds an extra wrinkle into the equation. Suppose you define a subclass:
>>> class Point3(Point):
... z: int
...
>>> Point3.__dict__['__init__']
<__main__.ClueGen___init__ object at 0x102ec1240>
>>>
Here, you'll see the "ClueGen" object make a return to the class dictionary. Again, it gets replaced when it's first accessed. Here's what happens at a low level when you make an instance:
>>> i = Point3.__dict__['__init__']
>>> i.__get__(None, Point3)
<function __init__ at 0x102e20950>
>>> Point3.__init__
<function __init__ at 0x102e20950>
>>> p = Point3(1,2,3)
>>> p
Point3(x=1, y=2, z=3)
>>>
For more reading, look for information on Python's "Descriptor Protocol." This is the same machinery that makes properties, classmethods, and other features of the object system work.
Questions and Answers
Q: What methods does cluegen
generate?
A: By default it generates __init__()
, __repr__()
, __iter__()
,
and __eq__()
methods. __match_args__
is also defined to assist with pattern matching.
Q: Does cluegen
enforce the specified types?
A: No. The types are merely clues about what the value might be and
the Python language does not provide any enforcement on its own. The
types might be useful in an IDE or third-party tools that perform
type-checking or linting. You could probably extend cluegen
to
enforce types if you wanted though.
Q: Does cluegen
use any advanced magic such as metaclasses?
A: No. The Datum
base class is a plain Python class. It defines an
__init_subclass__()
method to assist with the management of
subclasses, but nothing other than the standard special methods
such as __init__()
, __repr__()
, __iter__()
, and __eq__()
are
defined. Python's descriptor protocol is used to drive code generation.
Q: How do I install cluegen
?
A: There is no setup.py
file, installer, or an official release. You
install it by copying the code into your own project. cluegen.py
is
small. You are encouraged to copy and modify it to your own purposes.
Q: But what if new features get added?
A: What new features? The best new features are no new features.
Q: How do you pronounce and use cluegen
in a sentence?
A: You should pronounce it as "kludg-in" as in "runnin" or "trippin". So, if someone asks "what are you doing?", you don't say "I'm using cluegen." No, you'd say "I'm kludgin up some classes." The latter is more accurate as it describes both the tool and the thing that you're actually doing. Accuracy matters.
Q: How do you pronounce and use cluegen
while live-streaming?
A: "Oh. Oh. I'm totally kludgin it! Yes! YES! YES!!! OH! MY! GOD!!!!!"
Q: Who maintains cluegen?
A: If you're using it, you do. You maintain cluegen.
Q: Who wrote this?
A: cluegen
is the work of David Beazley. http://www.dabeaz.com.