Tajweed annotations for the Qur'an (riwayat hafs). The data is available as a JSON file with exact character indices for each rule, and as individual decision trees for each rule.
You can use this data to display the Qur'an with tajweed highlighting, refine models for Qur'anic speech recognition, or - if you enjoy decision trees - improve your own recitation.
The start and end indices of each annotation refer to the Unicode codepoint (not byte!) offset within the Tanzil.net Uthmani Qur'an text. NOTE: that the encoding of the files available from Tanzil.net has changed slightly since the annotations were generated, so please use this copy of the Qur'an text file: quran-uthmani.txt (downloaded ca. Apr 6, 2017). If you use a different Qur'an text file, you must rebuild the data file from scratch (at your own risk) - refer to the next section.
tajweed_classifier.py is a script that takes Tanzil.net "Text (with aya numbers)"-style input via STDIN, and produces the tajweed JSON file (as described above) via STDOUT. It reads the decision trees from rule_trees/*.json. Note that the trees have been built to function best with the Madani text; they rely on the prescence of pronunciation markers (e.g. maddah) that may not be present in other texts.
Ruleset reference
The following are renderings of the decision trees used to determine where each tajweed annotation starts and stops. Attributes are grouped by the letters they belong to, a letter being defined as a base character (e.g. Ù„) plus any diacritics that follow (codepoints in the Mn category). Superscript/dagger alif is counted as a base character. The numbers prefixing each attribute indicate which letter the attribute belongs to: negative referring to previous letters, positive to future letters. Attributes starting with 0_... refer to the exact character being considered. Annotations do not always start or stop on letter boundaries. Refer to tajweed_classifier.py for the definition of each attribute.