AI for a corpus of Buddhist literature

Patrick McAllister http://www.ikga.oeaw.ac.at

Created: 2018-11-22 Thu 10:32

Materials to consider

Approx. 150 texts, by approx. 40 authors
Original language of composition: Sanskrit
Substantial number of texts survive only in Tibetan
Some early ones also/only in Chinese translations

NOT systematically marked-up in any way

Problems for AI solutions?

Decomposition of Sanskrit euphonic combinations

tathā + bhāva ==> tathābhāva
“tathābhāva” can be analyzed as:
- tathā + abhāva (possible)
- OR tathā + bhāva (correct)
Example solution: https://github.com/OliverHellwig/sanskrit/ (ca. 15% of lines contain an error);
Possibly related class of problems: scriptio continua

Text alignment (editing assistance)

Discovery of (approximate) similar text passages
- parallels/quotations/silent reuse
- possibly across languages (mainly Sanskrit <–> Tibetan)
Alignment of main text ==> commentary ==> sub-commentary …
Alignment of translations

Realistic expectations?

Many texts available digitally (not systematically marked up)
Possible to create training corpus for Sanskrit <–> Tibetan alignment (mid-term)
Fairly good Sanskrit <–> Tibetan dictionaries

What we’d like:

Splitter for scriptio continua and compounds
Tools to align text passages of various types:
- “Lemmas”: Main text ==> Commentary
- “Quotations/Paraphrases”: Text A ==> Text B
- Translations