Endangered Language Documentation in the Digital Age

Authors

DOI:

https://doi.org/10.11606/issn.2236-4242.v34i2p47-64

Keywords:

Digital Dictionaries, Natural Language Processing, Uralic Languages, Linguistic Documentation, Open Infrastructure

Abstract

We present our infrastructure to document Uralic languages, which consists of tools to write dictionaries so that entries are structured in XML (Extensible Markup Language) format. From dictionaries in XML, we can generate code for morphological analysers useful for all kinds of NLP tasks. In this article, we show the advantages of digital and machine-readable documentation. We also describe the system in the context of endangered Uralic languages.

Downloads

Download data is not yet available.

References

AASMÄE, N.; PAJUSALU, K.; KABAJEVA, N. Gemination in the Mordvin Languages. Linguistica Uralica, 52(2). 2006. Disponible en: https://www.ceeol.com/search/article-detail?id=396961. Accedido en: 11 jul 2021

AHMADNIA, B.; SERRANO, J.; HAFFARI, G. Persian-Spanish Low-Resource Statistical Machine Translation Through English as Pivot Language. En Proceedings of RANLP. 2017 p. 24-30. DOI: 10.26615/978-954-452-049-6_004. Accedido en: 11 jul 2021

ALNAJJAR, K.; HÄMÄLÄINEN, M.; RUETER, J.; PARTANEN, N. Ve’rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement. En Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations. 2020 p. 1-6. DOI: 10.18653/v1/2020.coling-demos.1. Accedido en: 11 jul 2021

ANTONSEN, L.; ARGESE, C. Using authentic texts for grammar exercises for a minority language. En Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning (NLP4CALL 2018). Linköping Electronic Conference Proceedings. 2018 p. 1–9. Disponible en: https://ep.liu.se/en/conference-article.aspx?series=ecp&issue=152&Article_No=1 Accedido en: 11 jul 2021

AVIKAINEN, J. A Method for Wavelet-Based Time Series Analysis of Historical Newspapers. Universidad de Helsinki. Tesina de Master. 2019. Disponible en: https://helda.helsinki.fi/handle/10138/310021. Accedido en: 11 jul 2021

BICK, E.; DIDRIKSEN, T. Cg-3—beyond classical constraint grammar. En Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015). 2015. p. 31-39. Disponible en: https://aclanthology.org/W15-1807. Accedido en: 11 jul 2021

BON, B.; NOWAK, K. Wiki lexicographica. Linking medieval latin dictionaries with semantic MediaWiki. En Electronic lexicography in the 21st century: thinking outside the paper: proceedings of the eLex 2013 conference, Estonia, 2013, p. 407-420. Disponible en: https://dialnet.unirioja.es/servlet/articulo?codigo=4565204. Accedido en: 11 jul 2021

BONTOGON, M.; ARPPE, A.; ANTONSEN, L.; THUNDER, D.; LACHLER, J. Intelligent Computer Assisted Language Learning (ICALL) for nêhiyawêwin: An In-Depth User-Experience Evaluation. En Canadian Modern Language Review, 74(3). 2018. p. 337-362. DOI: https://doi.org/10.3138/cmlr.4054. Accedido en: 11 jul 2021

CHEN, X.; SUN, Y.; ATHIWARATKUN, B.; CARDIE, C.; WEINBERGER, K. Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics, 6. 2018. p. 557-570. Disponible en: https://arxiv.org/abs/1606.01614. Accedido en : 11 jul 2021

CREUTZ, M.; SJÖBLOM, E. E. Toward automatic improvement of language produced by non-native language learners. En Proceedings of the 8th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2019). 2019. p. 20-30. Disponible en: https://aclanthology.org/W19-6303. Accedido en: 11 jul 2021

DUEÑAS, G.; GÓMEZ, D. A bilingual dictionary with Semantic Mediawiki: The language Saliba's case. En The 4th International Conference on Language Documentation and Conservation (ICLDC). 2015. Disponible en: http://hdl.handle.net/10125/25338. Accedido en : 11 jul 2021

ENS, J.; HÄMÄLÄINEN, M.; RUETER, J.; PASQUIER, P. Morphosyntactic Disambiguation in an Endangered Language Setting. En 22nd Nordic Conference on Computational Linguistics (NoDaLiDa): Proceedings of the Conference. 2019. p. 345-349. Disponible en: https://aclanthology.org/W19-6139. Accedido en: 11 jul 2021

FACUNDES, S. D. S. The language of the Apurinã people of Brazil. Buffalo: State University of New York at Buffalo (Dissertation). 2000. Disponible en: http://www.etnolinguistica.org/tese:facundes-2000. Accedido en: 11 jul 2021

GRÜNTHAL, R. Transitivity in Erzya: Second language speakers in a grammatical focus. En Mordvin languages in the field. Finno-Ugrian Society. 2016. p. 291-318. Disponible en: https://researchportal.helsinki.fi/en/publications/transitivity-in-erzya-second-language-speakers-in-a-grammatical-f. Accedido en: 11 jul 2021

HÄMÄLÄINEN, M. UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37). 2019. Disponible en: https://joss.theoj.org/papers/10.21105/joss.01345. Accedido en: 11 jul 2021

HÄMÄLÄINEN, M.; RUETER, J. An open online dictionary for endangered Uralic languages. En Electronic lexicography in the 21st century (eLex 2019): Smart lexicography, 111. 2019a. Disponible en: http://hdl.handle.net/10138/305873. Accedido en: 11 jul 2021

HÄMÄLÄINEN, M.; RUETER, J. Finding Sami Cognates with a Character-Based NMT Approach. En Proceedings of the 3rd Workshop on Computational Methods in the Study of Endangered Languages: (Volume 1) Papers. 2019b. p. 39-45. Disponible en: https://aclanthology.org/W19-6006. Accedido en: 11 jul 2021

HÄMÄLÄINEN, M.; TARVAINEN, L. L.; RUETER, J. Combining Concepts and Their Translations from Structured Dictionaries of Uralic Minority Languages. En Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018 p. 862-867. Disponible en: https://aclanthology.org/L18-1138. Accedido en: 11 jul 2021

HAMARI, A. The abessive in the Permic languages. En Suomalais-Ugrilaisen Seuran Aikakauskirja, 2011(93). 2011. p.37-84. DOI: https://doi.org/10.33340/susa.82172. Accedido en: 11 jul 2021

HIMMELMANN, N. P. Documentary and descriptive linguistics. Linguistics, 36. 1998. p. 161-196. DOI: https://doi.org/10.1515/ling.1998.36.1.161. Accedido en : 11 jul 2021

HUNT, B.; CHEN, E.; SCHREINER, S. L.; SCHWARTZ, L. Community lexical access for an endangered polysynthetic language: An electronic dictionary for St. Lawrence Island Yupik. En Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). 2019 pp. 122–126. DOI: 10.18653/v1/N19-4021. Accedido en: 11 jul 2021

IRVINE, A.; CALLISON-BURCH, C. Hallucinating phrase translations for low resource mt. En Proceedings of the Eighteenth Conference on Computational Natural Language Learning. 2014. p. 160-170. DOI: 10.3115/v1/W14-1617. Accedido en: 11 jul 2021

KARLSSON, F. Constraint Grammar as a Framework for Parsing Unrestricted Text. En Proceedings of the 13th International Conference of Computational Linguistics, Vol. 3. 1990. p. 168-173. DOI: https://doi.org/10.3115/991146.991176. Accedido en : 11 jul 2021

KLUMPP, G. Semantic functions of complementizers in Permic languages. En Complementizer Semantics in European Languages, 2016. p. 529-586. DOI: https://doi.org/10.1515/9783110416619-016. Accedido en: 11 jul 2021

LINDÉN, K.; AXELSON, E.; DROBAC, S.; HARDWICK, S.; KUOKKALA, J.; NIEMI, J.; PIRINEN, T.; SILFVERBERG, M. HFST - A System for Creating NLP Tools. En Systems and Frameworks for Computational Morphology. Communications in Computer and Information Science. 380. Humboldt-Universität in Berlin: Springer. 2013. p. 53-71. DOI: 10.1007/978-3-642-40486-3_4. Accedido en: 11 jul 2021

LITTELL, P.; PINE, A.; DAVIS, H. Waldayu and Waldayu Mobile: Modern digital dictionary interfaces for endangered languages. En Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages. Association for Computational Linguistics. 2017. p. 141–150. DOI: 10.18653/v1/W17-0119. Accedido en: 11 jul 2021

MOSHAGEN, S.; RUETER, J.; PIRINEN, T.; TROSTERUD, T.; TYERS, F. M. Open-source infrastructures for collaborative work on under-resourced languages. En Collaboration and Computing for Under-Resourced Languages in the Linked Open Data Era. 2014. p. 71-77. Disponible en: http://www.syros.aegean.gr/users/spyrosv/papers/ccurl14.pdf#page=78. Accedido en: 11 jul 2021

MULJADI, H.; TAKEDA, H.; KAWAMOTO, S.; KOBAYASHI, S.; FUJIYAMA, A. Towards a Semantic Wiki-Based Japanese Biodictionary. En Proceedings of the First Workshop on Semantic Wikis - From Wiki to Semantics. 2006. Disponible en: http://www-kasm.nii.ac.jp/papers/takeda/06/muljadi06eswc.pdf. Accedido en: 11 jul 2021

NASUTION, A.H.; MURAKAMI, Y.; ISHIDA, T. Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages. En Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association. 2018. Disponible en: https://aclanthology.org/L18-1536. Accedido en: 11 jul 2021

PARTANEN, N.; BLOKLAND, R.; LIM, K.; POIBEAU, T.; RIESSLER, M. The first Komi-Zyrian universal dependencies treebanks. En Second Workshop on Universal Dependencies (UDW 2018). 2018. p. 126-132. DOI: 10.18653/v1/W18-6015. Accedido en: 11 jul 2021

RÄMÖ, M. (Re)lexicalization of auto-written news with contextual and cross-lingual word embeddings. Universidad de Helsinki. Tesina de Master. 2020. Disponible en: https://helda.helsinki.fi/handle/10138/321924. Accedido en: 11 jul 2021

RUETER, J.; HÄMÄLÄINEN, M. FST Morphology for the Endangered Skolt Sami Language. En Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020). 2020. p. 250-257. Disponible en: https://aclanthology.org/2020.sltu-1.35. Accedido en: 11 jul 2021

RUETER, J.; HÄMÄLÄINEN, M.; PARTANEN, N. Open-Source Morphology for Endangered Mordvinic Languages. En Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS). The Association for Computational Linguistics. 2020. p. 94–100. DOI: 10.18653/v1/2020.nlposs-1.13. Accedido en: 11 jul 2021

RUETER, J. M.; HÄMÄLÄINEN, M. Synchronized Mediawiki based analyzer dictionary development. En 3rd International Workshop for Computational Linguistics of Uralic Languages Proceedings of the Workshop. 2017. DOI: 10.18653/v1/W17-0601. Accedido en: 11 jul 2021

RUETER, J.; PARTANEN, N.; PONOMAREVA, L. On the questions in developing computational infrastructure for Komi-Permyak. En Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages. 2020 p. 15-25. Disponible en: https://aclanthology.org/2020.iwclul-1.3. Accedido en: 11 jul 2021

RUETER, J. M.; TYERS, F. M. Towards an open-source universal-dependency treebank for Erzya. En Proceedings of International Workshop for Computational Linguistics of Uralic Languages. 2018. DOI: 10.18653/v1/W18-0210. Accedido en: 11 jul 2021

SAMMALLAHTI, P.; MOSNIKOFF, J. Suomi-Koltansaame sanakirja. LÄÄʹDD-SÄÄʹm SÄÄʹNNÊʹRJJ Ohcejohka: Girjegiisá Oy. 1991.

Published

2021-09-01

How to Cite

HAMALAINEN, Mika; RUETER, Jack; ALNAJJAR, Khalid. Endangered Language Documentation in the Digital Age. Linha D’Água, São Paulo, v. 34, n. 2, p. 47–64, 2021. DOI: 10.11606/issn.2236-4242.v34i2p47-64. Disponível em: https://revistas.usp.br/linhadagua/article/view/181446.. Acesso em: 23 nov. 2024.