AI for a corpus of Buddhist literature
Created: 2018-11-22 Thu 10:32
Materials to consider
- Approx. 150 texts, by approx. 40 authors
- Original language of composition: Sanskrit
- Substantial number of texts survive only in Tibetan
- Some early ones also/only in Chinese translations
NOT systematically marked-up in any way
Problems for AI solutions?
Decomposition of Sanskrit euphonic combinations
- tathā + bhāva ==> tathābhāva
- “tathābhāva” can be analyzed as:
- Example solution: https://github.com/OliverHellwig/sanskrit/
(ca. 15% of lines contain an error);
- Possibly related class of problems: scriptio continua
Text alignment (editing assistance)
- Discovery of (approximate) similar text passages
- parallels/quotations/silent reuse
- possibly across languages (mainly Sanskrit <–> Tibetan)
- Alignment of main text ==> commentary ==> sub-commentary …
- Alignment of translations
Realistic expectations?
- Many texts available digitally (not systematically marked up)
- Possible to create training corpus for Sanskrit <–> Tibetan alignment (mid-term)
- Fairly good Sanskrit <–> Tibetan dictionaries
What we’d like:
- Splitter for scriptio continua and compounds
- Tools to align text passages of various types:
- “Lemmas”: Main text ==> Commentary
- “Quotations/Paraphrases”: Text A ==> Text B
- Translations