Toward the compilation of C-ORAL-ANGOLA
an informal spontaneous speech corpus of Angolan Portuguese
DOI:
https://doi.org/10.11606/issn.2176-9419.v20iEspecialp139-157Keywords:
Angolan Portuguese, Spontaneous speech, Corpus, CompilationAbstract
The paper introduces the architecture and compilation criteria for an Angolan Portuguese spontaneous speech corpus. After a brief introduction about the linguistic scenario in Angola, we present an in-depth description of the recording modalities and treatment related to the multiple sociolinguistic variations documented, with special attention to diaphasic variation. The first twenty-seven recorded texts are then detailed. These will make up a minicorpus, portraying at least 30,000 words. The minicorpus will be prosodically segmented and will display text-to-speech alignment. The last part of the article is dedicated to the methodological steps taken for the corpus compilation: acoustic quality definition, transcription criteria, prosodic segmentation procedures, revision, alignment and statistic validation.
Downloads
References
Barbosa PA, Raso T. Spontaneous speech segmentation: functional and prosodic aspects with applications for automatic segmentation. Revista de Estudos da Linguagem. 2018;26(4): 1361-1396.
Barth-Weingarten D. Intonation units revisited caesura in talk-in-interaction. Amsterdam: John Benjamins; 2016.
Bettencourt Gonçalves J, Veloso R. Spoken Portuguese: geographic and social varieties. Proceedings of the Second International Conference on Language Resources and Evaluation. Volume II. Athens, Greece: National Technical University of Athens Press; 2000. p. 905-908.
Bick E. A anotação gramatical do C-ORAL-BRASIL. In: Raso T, Mello H, editores. C-ORAL-BRASIL I. Corpus de referência do português brasileiro falado informal. Belo Horizonte: UFMG; 2012. p. 223-254.
Bick E. The grammatical annotation of speech corpora. Techniques and perspectives. In: Raso T, Mello H, editores. Spoken corpora and linguistic studies. Amsterdam: John Benjamins; 2014. p. 105-128.
Boersma P, Weenink D. Praat: doing phonetics by computer [programa de computador]. Amsterdam: Universiteit van Amsterdam; 2018. [citado 17 dez. 2018]. Disponível em: http://www.fon.hum.uva.nl/praat/
Carrenho JM, Constantini AC, Barbosa PA. Qualidade acústica para análises na fonética forense: construção de uma proposta de classificação. Comunicação ao XXIV Congresso Nacional de Criminalística, VII Congresso Internacional de Pericial Criminal, XXIV Exposição de Tecnologias Aplicadas à Criminalística.
Cavalcante F, Ramos A. The American English spontaneous speech minicorpus: architecture and comparability. CHIMERA: Romance Corpora and Linguistic Studies. 2016;3(2):99-124. [citado 17 dez. 2018]. Disponível em: https://revistas.uam.es/index.php/chimera/article/view/6507
Central Intelligence Agengy. The world factbook. [citado 5 out. 2018]. Disponível em: https://www.cia.gov/library/publications/the-world-factbook/fields/2103.html
Cresti E. Corpus di italiano parlato. Firenze: AccademiadellaCrusca; 2000. 2 Vols.
Cresti E. Notes on lexical strategy, structural strategies and surface clause indexes in the C-ORAL-ROM spoken corpora. In: Cresti E, Moneglia M, editores. C-ORAL-ROM: integrated reference corpora for spoken Romance Languages. Amsterdam, Philadelphia: John Benjamins; 2005. p. 209-256.
Cresti E, Moneglia M, editores. C-ORAL-ROM: integrated reference corpora for spoken Romance Languages. Amsterdam, Philadelphia: John Benjamins; 2005.
Du Bois J W, Chafe WL, Meyer C, Thompson S, Santa Barbara Corpus of Spoken American English. Washington DC: Linguistic Data Consortium; 2000-2005.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin. 1971;76:378-382.
Gregori L, Panunzi A. DB-IPIC: An XML database for informational patterning analysis. In: Mello H, Pettorino M, Raso T, editors. Proceedings of the 7th GSCP International Conference. Speech and Corpora. Florence: Firenze University Press; 2012. p. 121–127.
Izre’el S. Syntax, prosody, discourse and information Structure: the case for unipartite clauses. A View from Spoken Israeli Hebrew. Revista de Estudos da Linguagem; no prelo.
Izre’el S, Mello H, Panunzi A, Raso T, editores. In search for a reference unit of spoken language: a corpus driven approach. Amsterdam: John Benjamins; em preparação.
Izre’el S, Rahav G. The corpus of spoken Israeli Hebrew (CoSIH); Phase I: the pilot study. In: Oostdijk N, Kristoffersen G, Sampson G, editors. LREC 2004 Sattelite Workshop, Fourth International Conference on Language Resources and Evaluation: Compiling and Processing Spoken Language Corpora. Lisbon, Portugal. Paris: ELRA - European Language Resources Association; 2004. p. 1-7.
Linell P. The written language bias in linguistics. New York: Routledge; 2005.
Mello H. Methodological issues for spontaneous speech corpora compilation. The case of C-ORAL-BRASIL. In: Raso T, Mello H, editores. Spoken corpora and linguistic studies. Amsterdam: John Benjamins; 2014. p. 27-68.
Mello H, Raso T, Mittmann M, Vale H, Côrtes P. Transcrição e segmentação prosódica do corpus c-oral-brasil: critérios de implementação e validação. In: Raso T, Mello H, editores. C-ORAL-BRASIL I. Corpus de referência do português brasileiro falado informal. Belo Horizonte: UFMG; 2012. p. 125-174.
Mettouchi A, Vanhove M, Caubet D, editors. Corpus-based studies of lesser-described languages: the CorpAfroAs corpus of spoken Afro Asiatic languages. Studies in Corpus Linguistics 68. John Benjamins: Amsterdam-Philadelphia; 2015.
Mittmann MM, Barbosa PA. An automatic speech segmentation tool based on multiple acoustic parameters. CHIMERA: Romance Corpora and Linguistic Studies. 2016;3(2):133-147.
Mittmann MM, Raso T. The C-ORAL-BRASIL informationally tagged minicorpus. In: Mello H, Panunzi A, Raso T. Pragmatics and prosody: illocution, modality, attitude, information structure and speech annotation; 2011. p. 151-183
Moneglia M. 2005. The C-ORAL-ROM resource. In: Cresti E, Moneglia M, editors. C-ORAL-ROM: Integrated reference corpora for spoken romance languages. Amsterdam: John Benjamins; 2005. p. 1–70.
Moneglia M, Raso T. Notes Language into Act Theory (L-AcT). In: Raso T, Mello H, editors. In: Spoken Corpora and Linguistic Studies. Amsterdam: John Benjamins; 2014. p. 468-495.
Nicolas Martinez C, Lombán M. Mini-Corpus del español para DB-IPIC. CHIMERA. Romance Corpora and Linguistic Studies. No prelo.
Panunzi A, Gregori L. DB-IPIC. An XML database for the representation of information structure in spoken language. In: Mello H, Panunzi A, Raso T, editors. Pragmatics and prosody. Illocution, modality, attitude, information structure and speech annotation. Florence: Firenze University Press; 2011. P. 19–37.
Panunzi A, Mittmann MM. The IPIC resource and a cross-linguistic analysis of information structure in Italian and Brazilian Portuguese In: Raso T, Mello H, editors. Spoken corpora and linguistic studies. Amsterdam: John Benjamins; 2014. p. 129-151.
Raso T. O corpus C-ORAL-BRASIL. In: Raso T, Mello H, editores. C-ORAL-BRASIL I Corpus de referência do português brasileiro falado informal; 2012. 55–90.
Raso T, Mello H, editores. C-ORAL-BRASIL I. Corpus de referência do português brasileiro falado informal. Belo Horizonte: UFMG; 2012.
Raso T, .Mello H. C-ORAL-BRASIL: description, methodology and theoretical framework. In: Tony Berber Sardinha T, São Bento TL, editors. Working with Portuguese Corpora. London-New Delhi-New York-Sydney: Bloomsbury; 2014. p. 257-278.
Raso T, Mello H, editores. C-ORAL-BRASIL I. Corpus de referência do português brasileiro da fala formal em contexto natural, de mídia e de telefone. Em preparação.
Raso T, Mittmann MM. As principais medidas da fala. In: Raso T, Mello H, editores. C-ORAL-BRASIL I. Corpus de referência do português brasileiro falado informal. Belo Horizonte: UFMG, 2012. p. 177-220.
Raso T, Mittmann MM, Oliveira A. O papel da pausa na segmentação prosódica de corpora de fala. Revista de Estudos da Linguagem, v. 23; 2015. p. 883-922-922. Disponível em: http://www.periodicos.letras.ufmg.br/index.php/relin/article/download/9536/8799
Raso T, Soares E, Miranda I. Um minicorpus de fala telefônica do português brasileiro etiquetado informacionalmente; em preparação.
Santos F, Freitas T. CORP-ORAL: Spontaneous speech corpus for European Portuguese. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC; 2008. Disponível em: http://www.lrec-conf.org/proceedings/lrec2008/
Simons GF, Fenning CD, editors. Ethnologue: languages of the world, languages of Angola, Twenty-first edition. Dallas, Texas: SIL International; 2018. Disponível em: www.ethnologue.com
Downloads
Published
Issue
Section
License
Copyright is transferred to the journal for the online publication, with free access, and for the printing in paper documents. Copyright may be preserved for authors who wish to republish their work in collections.






