This paper describes a collection of 20k ELAN annotation files harvested from five different endangered language archives. The ELAN files form a very heterogeneous set, but the hierarchical configuration of their tiers allow, in conjunction with the tier content, to identify transcriptions, translations, and glosses. These transcriptions, translations, and glosses are queryable across archives. Small analyses of graphemes (transcription tier), grammatical and lexical glosses (gloss tier), and semantic concepts (translation tier) show the viability of the approach. The use of identifiers from OLAC, Wikidata and Glottolog allows for a better integration of the data from these archives into the Linguistic Linked Open Data Cloud.

Author: Nordhoff, Sebastian (Leibniz-Zentrum Allgemeine Sprachwissenschaft-ZAS Berlin)

Conference: LREC 2020 - Proceedings of the Workshop about Language Resources for the SSH Cloud (LR4SSHOC)

Date: May 2020

Publisher: European Language Resources Association