AI for a corpus of Buddhist literature

Patrick McAllister http://www.ikga.oeaw.ac.at

Created: 2018-11-22 Thu 10:32

Materials to consider

  • Approx. 150 texts, by approx. 40 authors
  • Original language of composition: Sanskrit
  • Substantial number of texts survive only in Tibetan
  • Some early ones also/only in Chinese translations

NOT systematically marked-up in any way

Problems for AI solutions?

Decomposition of Sanskrit euphonic combinations

  • tathā + bhāva ==> tathābhāva
  • “tathābhāva” can be analyzed as:
    • tathā + abhāva (possible)
    • OR tathā + bhāva (correct)

  • Example solution: https://github.com/OliverHellwig/sanskrit/ (ca. 15% of lines contain an error);
  • Possibly related class of problems: scriptio continua

Text alignment (editing assistance)

  1. Discovery of (approximate) similar text passages
    • parallels/quotations/silent reuse
    • possibly across languages (mainly Sanskrit <–> Tibetan)
  2. Alignment of main text ==> commentary ==> sub-commentary …
  3. Alignment of translations

Realistic expectations?

  1. Many texts available digitally (not systematically marked up)
  2. Possible to create training corpus for Sanskrit <–> Tibetan alignment (mid-term)
  3. Fairly good Sanskrit <–> Tibetan dictionaries

What we’d like:

  1. Splitter for scriptio continua and compounds
  2. Tools to align text passages of various types:
    • “Lemmas”: Main text ==> Commentary
    • “Quotations/Paraphrases”: Text A ==> Text B
    • Translations