Endangered Language Documentation in the Digital Age
DOI:
https://doi.org/10.11606/issn.2236-4242.v34i2p47-64Keywords:
Digital Dictionaries, Natural Language Processing, Uralic Languages, Linguistic Documentation, Open InfrastructureAbstract
We present our infrastructure to document Uralic languages, which consists of tools to write dictionaries so that entries are structured in XML (Extensible Markup Language) format. From dictionaries in XML, we can generate code for morphological analysers useful for all kinds of NLP tasks. In this article, we show the advantages of digital and machine-readable documentation. We also describe the system in the context of endangered Uralic languages.
Downloads
References
AASMÄE, N.; PAJUSALU, K.; KABAJEVA, N. Gemination in the Mordvin Languages. Linguistica Uralica, 52(2). 2006. Disponible en: https://www.ceeol.com/search/article-detail?id=396961. Accedido en: 11 jul 2021
AHMADNIA, B.; SERRANO, J.; HAFFARI, G. Persian-Spanish Low-Resource Statistical Machine Translation Through English as Pivot Language. En Proceedings of RANLP. 2017 p. 24-30. DOI: 10.26615/978-954-452-049-6_004. Accedido en: 11 jul 2021
ALNAJJAR, K.; HÄMÄLÄINEN, M.; RUETER, J.; PARTANEN, N. Ve’rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement. En Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations. 2020 p. 1-6. DOI: 10.18653/v1/2020.coling-demos.1. Accedido en: 11 jul 2021
ANTONSEN, L.; ARGESE, C. Using authentic texts for grammar exercises for a minority language. En Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning (NLP4CALL 2018). Linköping Electronic Conference Proceedings. 2018 p. 1–9. Disponible en: https://ep.liu.se/en/conference-article.aspx?series=ecp&issue=152&Article_No=1 Accedido en: 11 jul 2021
AVIKAINEN, J. A Method for Wavelet-Based Time Series Analysis of Historical Newspapers. Universidad de Helsinki. Tesina de Master. 2019. Disponible en: https://helda.helsinki.fi/handle/10138/310021. Accedido en: 11 jul 2021
BICK, E.; DIDRIKSEN, T. Cg-3—beyond classical constraint grammar. En Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015). 2015. p. 31-39. Disponible en: https://aclanthology.org/W15-1807. Accedido en: 11 jul 2021
BON, B.; NOWAK, K. Wiki lexicographica. Linking medieval latin dictionaries with semantic MediaWiki. En Electronic lexicography in the 21st century: thinking outside the paper: proceedings of the eLex 2013 conference, Estonia, 2013, p. 407-420. Disponible en: https://dialnet.unirioja.es/servlet/articulo?codigo=4565204. Accedido en: 11 jul 2021
BONTOGON, M.; ARPPE, A.; ANTONSEN, L.; THUNDER, D.; LACHLER, J. Intelligent Computer Assisted Language Learning (ICALL) for nêhiyawêwin: An In-Depth User-Experience Evaluation. En Canadian Modern Language Review, 74(3). 2018. p. 337-362. DOI: https://doi.org/10.3138/cmlr.4054. Accedido en: 11 jul 2021
CHEN, X.; SUN, Y.; ATHIWARATKUN, B.; CARDIE, C.; WEINBERGER, K. Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics, 6. 2018. p. 557-570. Disponible en: https://arxiv.org/abs/1606.01614. Accedido en : 11 jul 2021
CREUTZ, M.; SJÖBLOM, E. E. Toward automatic improvement of language produced by non-native language learners. En Proceedings of the 8th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL 2019). 2019. p. 20-30. Disponible en: https://aclanthology.org/W19-6303. Accedido en: 11 jul 2021
DUEÑAS, G.; GÓMEZ, D. A bilingual dictionary with Semantic Mediawiki: The language Saliba's case. En The 4th International Conference on Language Documentation and Conservation (ICLDC). 2015. Disponible en: http://hdl.handle.net/10125/25338. Accedido en : 11 jul 2021
ENS, J.; HÄMÄLÄINEN, M.; RUETER, J.; PASQUIER, P. Morphosyntactic Disambiguation in an Endangered Language Setting. En 22nd Nordic Conference on Computational Linguistics (NoDaLiDa): Proceedings of the Conference. 2019. p. 345-349. Disponible en: https://aclanthology.org/W19-6139. Accedido en: 11 jul 2021
FACUNDES, S. D. S. The language of the Apurinã people of Brazil. Buffalo: State University of New York at Buffalo (Dissertation). 2000. Disponible en: http://www.etnolinguistica.org/tese:facundes-2000. Accedido en: 11 jul 2021
GRÜNTHAL, R. Transitivity in Erzya: Second language speakers in a grammatical focus. En Mordvin languages in the field. Finno-Ugrian Society. 2016. p. 291-318. Disponible en: https://researchportal.helsinki.fi/en/publications/transitivity-in-erzya-second-language-speakers-in-a-grammatical-f. Accedido en: 11 jul 2021
HÄMÄLÄINEN, M. UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37). 2019. Disponible en: https://joss.theoj.org/papers/10.21105/joss.01345. Accedido en: 11 jul 2021
HÄMÄLÄINEN, M.; RUETER, J. An open online dictionary for endangered Uralic languages. En Electronic lexicography in the 21st century (eLex 2019): Smart lexicography, 111. 2019a. Disponible en: http://hdl.handle.net/10138/305873. Accedido en: 11 jul 2021
HÄMÄLÄINEN, M.; RUETER, J. Finding Sami Cognates with a Character-Based NMT Approach. En Proceedings of the 3rd Workshop on Computational Methods in the Study of Endangered Languages: (Volume 1) Papers. 2019b. p. 39-45. Disponible en: https://aclanthology.org/W19-6006. Accedido en: 11 jul 2021
HÄMÄLÄINEN, M.; TARVAINEN, L. L.; RUETER, J. Combining Concepts and Their Translations from Structured Dictionaries of Uralic Minority Languages. En Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018 p. 862-867. Disponible en: https://aclanthology.org/L18-1138. Accedido en: 11 jul 2021
HAMARI, A. The abessive in the Permic languages. En Suomalais-Ugrilaisen Seuran Aikakauskirja, 2011(93). 2011. p.37-84. DOI: https://doi.org/10.33340/susa.82172. Accedido en: 11 jul 2021
HIMMELMANN, N. P. Documentary and descriptive linguistics. Linguistics, 36. 1998. p. 161-196. DOI: https://doi.org/10.1515/ling.1998.36.1.161. Accedido en : 11 jul 2021
HUNT, B.; CHEN, E.; SCHREINER, S. L.; SCHWARTZ, L. Community lexical access for an endangered polysynthetic language: An electronic dictionary for St. Lawrence Island Yupik. En Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). 2019 pp. 122–126. DOI: 10.18653/v1/N19-4021. Accedido en: 11 jul 2021
IRVINE, A.; CALLISON-BURCH, C. Hallucinating phrase translations for low resource mt. En Proceedings of the Eighteenth Conference on Computational Natural Language Learning. 2014. p. 160-170. DOI: 10.3115/v1/W14-1617. Accedido en: 11 jul 2021
KARLSSON, F. Constraint Grammar as a Framework for Parsing Unrestricted Text. En Proceedings of the 13th International Conference of Computational Linguistics, Vol. 3. 1990. p. 168-173. DOI: https://doi.org/10.3115/991146.991176. Accedido en : 11 jul 2021
KLUMPP, G. Semantic functions of complementizers in Permic languages. En Complementizer Semantics in European Languages, 2016. p. 529-586. DOI: https://doi.org/10.1515/9783110416619-016. Accedido en: 11 jul 2021
LINDÉN, K.; AXELSON, E.; DROBAC, S.; HARDWICK, S.; KUOKKALA, J.; NIEMI, J.; PIRINEN, T.; SILFVERBERG, M. HFST - A System for Creating NLP Tools. En Systems and Frameworks for Computational Morphology. Communications in Computer and Information Science. 380. Humboldt-Universität in Berlin: Springer. 2013. p. 53-71. DOI: 10.1007/978-3-642-40486-3_4. Accedido en: 11 jul 2021
LITTELL, P.; PINE, A.; DAVIS, H. Waldayu and Waldayu Mobile: Modern digital dictionary interfaces for endangered languages. En Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages. Association for Computational Linguistics. 2017. p. 141–150. DOI: 10.18653/v1/W17-0119. Accedido en: 11 jul 2021
MOSHAGEN, S.; RUETER, J.; PIRINEN, T.; TROSTERUD, T.; TYERS, F. M. Open-source infrastructures for collaborative work on under-resourced languages. En Collaboration and Computing for Under-Resourced Languages in the Linked Open Data Era. 2014. p. 71-77. Disponible en: http://www.syros.aegean.gr/users/spyrosv/papers/ccurl14.pdf#page=78. Accedido en: 11 jul 2021
MULJADI, H.; TAKEDA, H.; KAWAMOTO, S.; KOBAYASHI, S.; FUJIYAMA, A. Towards a Semantic Wiki-Based Japanese Biodictionary. En Proceedings of the First Workshop on Semantic Wikis - From Wiki to Semantics. 2006. Disponible en: http://www-kasm.nii.ac.jp/papers/takeda/06/muljadi06eswc.pdf. Accedido en: 11 jul 2021
NASUTION, A.H.; MURAKAMI, Y.; ISHIDA, T. Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages. En Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association. 2018. Disponible en: https://aclanthology.org/L18-1536. Accedido en: 11 jul 2021
PARTANEN, N.; BLOKLAND, R.; LIM, K.; POIBEAU, T.; RIESSLER, M. The first Komi-Zyrian universal dependencies treebanks. En Second Workshop on Universal Dependencies (UDW 2018). 2018. p. 126-132. DOI: 10.18653/v1/W18-6015. Accedido en: 11 jul 2021
RÄMÖ, M. (Re)lexicalization of auto-written news with contextual and cross-lingual word embeddings. Universidad de Helsinki. Tesina de Master. 2020. Disponible en: https://helda.helsinki.fi/handle/10138/321924. Accedido en: 11 jul 2021
RUETER, J.; HÄMÄLÄINEN, M. FST Morphology for the Endangered Skolt Sami Language. En Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020). 2020. p. 250-257. Disponible en: https://aclanthology.org/2020.sltu-1.35. Accedido en: 11 jul 2021
RUETER, J.; HÄMÄLÄINEN, M.; PARTANEN, N. Open-Source Morphology for Endangered Mordvinic Languages. En Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS). The Association for Computational Linguistics. 2020. p. 94–100. DOI: 10.18653/v1/2020.nlposs-1.13. Accedido en: 11 jul 2021
RUETER, J. M.; HÄMÄLÄINEN, M. Synchronized Mediawiki based analyzer dictionary development. En 3rd International Workshop for Computational Linguistics of Uralic Languages Proceedings of the Workshop. 2017. DOI: 10.18653/v1/W17-0601. Accedido en: 11 jul 2021
RUETER, J.; PARTANEN, N.; PONOMAREVA, L. On the questions in developing computational infrastructure for Komi-Permyak. En Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages. 2020 p. 15-25. Disponible en: https://aclanthology.org/2020.iwclul-1.3. Accedido en: 11 jul 2021
RUETER, J. M.; TYERS, F. M. Towards an open-source universal-dependency treebank for Erzya. En Proceedings of International Workshop for Computational Linguistics of Uralic Languages. 2018. DOI: 10.18653/v1/W18-0210. Accedido en: 11 jul 2021
SAMMALLAHTI, P.; MOSNIKOFF, J. Suomi-Koltansaame sanakirja. LÄÄʹDD-SÄÄʹm SÄÄʹNNÊʹRJJ Ohcejohka: Girjegiisá Oy. 1991.
Published
Issue
Section
License
Copyright (c) 2021 Mika Hämäläinen, Jack Rueter, Khalid Alnajjar
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The Editorial Board authorizes free access to and distribution of published contentes, provided that the source is cited, that is, granding credit to the authors and Linha D'Água and preserving the full text. The author is allowed to place the final version (postprint / editor’s PDF) in an institutional/thematic repositor or personal page (site, blog), immediately after publication, provided that it is available for open access and comes without any embargo period. Full reference should be made to the first publication in Linha D'Água. Access to the paper should at least be aligned with the access the journal offers.
As a legal entity, the University of São Paulo at Ribeirão Preto School of Philosophy, Sciences and Languages owns and holds the copyright deriving from the publication. To use the papers, Paidéia adopts the Creative Commons Licence, CC BY-NC non-commercial attribution. This licence permits access, download, print, share, reuse and distribution of papers, provided that this is for non-commercial use and that the source is cited, giving due authorship credit to Linha D'Água. In these cases, neither authors nor editors need any permission.
Partial reproduction of other publications
Citations of more than 500 words, reproductions of one or more figures, tables or other illustrions should be accompanied by written permission from the copyright owner of the original work with a view to reproduction in Linha D'Água. This permission has to be addressed to the author of the submitted manuscript. Secondarily obtained rights will not be transferred under any circumstance.