• Stars
    star
    204
  • Rank 190,926 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created about 12 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

a Python implementation of the Unicode Collation Algorithm

pyuca: Python Unicode Collation Algorithm implementation

Build Status Coverage Status MIT License

DOI JOSS

This is a Python implementation of the Unicode Collation Algorithm (UCA). It passes 100% of the UCA conformance tests for Unicode 5.2.0 (Python 2.7), Unicode 6.3.0 (Python 3.3+), Unicode 8.0.0 (Python 3.5+), Unicode 9.0.0 (Python 3.6+), and Unicode 10.0.0 (Python 3.7+) with a variable-weighting setting of Non-ignorable.

What do you use it for?

In short, sorting non-English strings properly.

The core of the algorithm involves multi-level comparison. For example, café comes before caff because at the primary level, the accent is ignored and the first word is treated as if it were cafe. The secondary level (which considers accents) only applies then to words that are equivalent at the primary level.

The Unicode Collation Algorithm and pyuca also support contraction and expansion. Contraction is where multiple letters are treated as a single unit. In Spanish, ch is treated as a letter coming between c and d so that, for example, words beginning ch should sort after all other words beginnings with c. Expansion is where a single letter is treated as though it were multiple letters. In German, ä is sorted as if it were ae, i.e. after ad but before af.

How to use it

Here is how to use the pyuca module.

pip install pyuca

Usage example:

from pyuca import Collator
c = Collator()

assert sorted(["cafe", "caff", "café"]) == ["cafe", "caff", "café"]
assert sorted(["cafe", "caff", "café"], key=c.sort_key) == ["cafe", "café", "caff"]

Collator can also take an optional filename for specifying a custom collation element table.

You can also import collators for specific Unicode versions, e.g. from pyuca.collator import Collator_8_0_0. But just from pyuca import Collator will ensure that the collator version matches the version of unicodata provided by the standard library for your version of Python.

How to cite it

Tauber, J. K. (2016). pyuca: a Python implementation of the Unicode Collation Algorithm. The Journal of Open Source Software. DOI: 10.21105/joss.00021

License

Python code is made available under an MIT license (see LICENSE). allkeys.txt is made available under the similar license defined in LICENSE-allkeys.

Contacting the Developer

If you have any problems, questions or suggestions, it's best to file an issue on GitHub although you can also contact me at [email protected].

For more of my work on linguistics and Ancient Greek, see http://jktauber.com/.

More Repositories

1

cleese

an operating system in Python
C
210
star
2

dcpu16py

A Python implementation of Notch's DCPU-16 (complete with assembler, disassembler, debugger and video terminal implementations)
Python
191
star
3

quantumpy

basic quantum computing concepts implemented in Python
Python
180
star
4

applepy

an Apple ][ emulator in Python
Python
119
star
5

blockchain

a bitcoin blockchain parser in Python
Python
118
star
6

sebastian

symbolic music analysis and composition library in Python
Python
115
star
7

mars-clock

an (in progress) interactive explanation of the time on Mars
HTML
57
star
8

pykanren

an attempt to implement miniKanren and microKanren in Python
Python
51
star
9

greek-accentuation

Python 3 library for accenting (and analyzing the accentuation of) Ancient Greek words
Python
49
star
10

functional-differential-geometry

working my way through Sussman and Wisdom's Functional Differential Geometry and implementing it in Python
Python
46
star
11

DCPU-16-Examples

Example Code for Notch's DCPU-16 Instruction Set
DCPU-16 ASM
41
star
12

gtp

Python implementation of Go Text Protocol
Python
40
star
13

greek-inflexion

Python library for generating (and analyzing) Ancient Greek inflectional paradigms
Python
34
star
14

web-graphics

utilities for generating website graphics like gradients and textures
Python
34
star
15

mandelbulb

exploring the "mandelbulb" 3D fractal
Python
31
star
16

Rev

an implementation of Git-like ideas in Python
Python
31
star
17

greek-reader

Python 3 tool for generating (initially Biblical) Greek readers
Python
31
star
18

forth

attempt to write a simple Forth interpreter in Python
Python
30
star
19

Rel

an exploration of the relational model and data analysis in Python
Python
28
star
20

sgf

Python implementation of Smart Game Format
Python
28
star
21

pyifs

An Iterated Function System in Python
Python
27
star
22

candelabra

a browser-based time tracker capable of tracking time on multiple projects
25
star
23

apostolic-fathers

Corrected Lake Greek Text and (eventually) analysis
Python
23
star
24

graded-reader

tools for creating computer-generated, corpus-driven graded readers
Python
23
star
25

learning-greek

researching how to improve the way people learn Koine Greek
Python
19
star
26

greek-normalisation

utilities for validating and normalising Ancient Greek text
Python
18
star
27

vocabulary-tools

consolidating various tools for exploring vocabulary stats and ordering
Python
18
star
28

czerny

a Python tool for assessing the performance of piano exercises
Python
17
star
29

skyrim

code for exploring Skyrim data files
Python
16
star
30

team566

DjangoDash 2010 entry — see http://dash.manoria.com/
Python
16
star
31

django-atompub

implementation of Atom format and protocol for the Django web framework
Python
15
star
32

lotro

code for exploring LOTRO data files
Python
15
star
33

greek-utils

various utilities for processing Ancient Greek
Python
15
star
34

pycpu

Emulators for various CPUs, written in Python, consolidated from previous work.
Python
15
star
35

greek-lemma-mappings

mappings between the headwords of various NT Greek lexicons, the lemmas of MorphGNT and Nestle 1904, and Strongs and GK numbers
13
star
36

online-reader

framework and tools for statically-generated and dynamic online reading environments
HTML
12
star
37

a2disk

Read Apple ][ DOS 3.3 Disk Images with Python 3 (and de-tokenize Applesoft BASIC)
Python
12
star
38

dcpu16os

An Operating System for the DCPU-16
DCPU-16 ASM
11
star
39

greek-texts

organising collaboration on Greek texts and vocabulary lists for learners
10
star
40

plato-texts

Greek texts (eventually) with linguistic annotation (for Greek Learner Texts Project)
10
star
41

pyv6

a wild attempt to port xv6 to Python
C
10
star
42

core-gnt-vocab

50% and 80% vocab lists for Greek New Testament
Python
9
star
43

enchiridion

Epictetus's Enchiridion text and analysis
Python
9
star
44

ultima4

code for exploring Ultima IV data files
Python
9
star
45

cassidy

The beginnings of a CSS library for Python
Python
9
star
46

pynock

An implementation of Nock in Python
Python
9
star
47

homer-ngram

calculation and visualisation of repeating n-grams in Homer and beyond
JavaScript
8
star
48

termdoc

Python library and tools for working with term-document matrices
Python
8
star
49

u4a2

disassembly, analysis, and commentary on the code for Ultima IV for the Apple ][
Ruby
7
star
50

minilight

a global illumination renderer
Python
6
star
51

minecraft

Python library for reading Minecraft data
Python
6
star
52

inflexion

generic code for morphological analysis (extracted from work on Ancient Greek)
Python
6
star
53

aesop

Greek text of Aesop (eventually) with linguistic annotation (for greek-texts project)
6
star
54

monte-cristo

Python
5
star
55

labe-luxnos

exploring the use of text adventures / interactive fiction for learning Ancient Greek
5
star
56

lgpsi-processing

repo to set up processing scripts and data for LGPSI
Python
5
star
57

greek-vocabulary-list

5
star
58

gothica

linguistic data, text, and code relating to the Gothic language
HTML
5
star
59

gnt-texts

various Greek New Testaments texts in the greek-texts format
5
star
60

sir-gawain-and-the-green-knight

Text, translation, and annotation of Sir Gawain and the Green Knight
5
star
61

text-validator

pluggable command-line tool for validating the formatting and orthography of text files
Python
5
star
62

gnt-wordle

TypeScript
4
star
63

parse-helper

a javascript library for building controls for assisting in the entry of parsing codes during linguistic annotation
JavaScript
4
star
64

qmorph

tabular data querying and pivoting in Python3 (with examples particular to linguistic morphology)
Python
4
star
65

interactives

collection of canvas/javascript interactive widgets illustrating various things
HTML
4
star
66

postag-convert

Python library for converting between various morphosyntactic tagging schemes (initially just for Ancient Greek)
Python
3
star
67

oxlos2

(A reboot of) a Pinax-based platform for crowd-sourced collaborative corpus linguistics
Python
3
star
68

woodhouse

making Woodhouse's English-Greek Dictionary more machine-actionable
3
star
69

sequencing-tools

consolidating various tools for language sequencing
Python
3
star
70

brainf

a Python implementation of the BrainF*** language
Python
3
star
71

minilight-swift

port of the global illumination renderer, Minilight, to Swift
Swift
3
star
72

wrk-reader

exploring how to read old Cakewalk WRK files
Python
3
star
73

susanna

analysis of the LXX narrative
3
star
74

dux

yet another implementation of Redux in Python
Python
3
star
75

gnt-accentuation

putting together existing accentuation code to explain accentuation of each word in GNT
Python
3
star
76

articles

standalone articles (eventually) on a variety of topics
HTML
3
star
77

agc-notes

Notes on the Apollo Guidance Computer
2
star
78

moby-dick

HTML
2
star
79

greek-validator-plugins

Ancient-Greek-specific plugins for my text-validator library
2
star
80

cunliffe

exploring Cunliffe's Homeric lexicon
HTML
2
star
81

anne-of-green-gables

text encoding and analysis of Anne of Green Gables
Python
2
star
82

dunsany

text encoding and analysis of various works of Lord Dunsany
Python
2
star
83

church

Church Encoding in Python
Python
2
star
84

germanic-philology

playing around with data and code for comparative germanic philology
Python
2
star
85

quantum-workbench

Python
2
star
86

minilight-go

a global illumination renderer in Go
Go
2
star
87

ultima5

code for exploring Ultima V data files
HTML
2
star
88

leukippos

an atom client in the browser
JavaScript
2
star
89

midiwrite

lightweight module for writing MIDI files
Python
2
star
90

littlewood

Python code for drawing Littlewood fractals
Python
2
star
91

urbit-notes

My notes as I explore Nock, Hoon, Arvo and Urbit
2
star
92

ultima6

code for exploring Ultima VI data files
Python
2
star
93

ephemerides

astronomical calculations in pure Python
Python
1
star
94

folds

exploring computational geometry of folding
Python
1
star
95

ceusutils

Python utilities for reading, analyzing and converting Bösendorfer CEUS files
Python
1
star
96

features

linguistic feature structures with an initial focus on contrastive feature hierarchies
Python
1
star
97

orbits

An in-progress collection of visualisations of Kepler's orbital model.
HTML
1
star
98

redfoot-orig

[ORIGINAL CVS] Redfoot is an extensible RDF server written in Python for building a Semantic Web of P2P nodes.
Python
1
star
99

diorisis

Exploring the Diorisis Ancient Greek Corpus
Python
1
star
100

oscar

exploring some ideas from Gary Burton's Jazz Improvisation class
Python
1
star