# AThEME Verona-Trento Corpus
## Description
The AThEME Verona-Trento Corpus is a spoken corpus composed of data collected during the AThEME (Advancing the European Multilingual Experience 2014–2019) project in the Work Package 2 ‘Regional Languages’ by the units of Verona (Prof. Birgit Alber, Prof. Andrea Padovan, Prof. Stefan Rabanus, Prof. Alessandra Tomaselli) and Trento (Prof. Ermenegildo Bidese, Prof. Jan Casalicchio, Prof. Patrizia Cordin). The AThEME project was a large-scale European project that took “an integrated approach towards the study of multilingualism in Europe by incorporating and combining linguistic, cognitive and sociological perspectives; by studying multilingualism in Europe at three different levels of societal magnitude, viz. the individual multilingual citizen, the multilingual group, and the multilingual society; by using a palate of research methodologies, ranging from fieldwork methods to various experimental techniques and advanced EEG/ERP technologies” (see project description at https://cordis.europa.eu/project/id/613465). The contribution of the Trento/Verona units was titled ‘Germanic-Romance language contact in the Southern-Central Alps’ and the data was collected via linguistic fieldwork on location. This corpus is composed of the resulting audio files and transcriptions.
The corpus contains the responses to two questionnaires: a phonological questionnaire and a morpho-syntactical questionnaire. The phonological questionnaire targeted the obstruent inventory, final devoicing, s-retractions, and the realization of /r/. The morpho-syntactical questionnaire targeted adjectives (position and inflection of the attributive, predicative and adverbial adjectives; comparatives and superlatives), pronouns (personal-pronoun paradigm for case, number, and gender marking), noun/article (gender; proper names), and the formation of movement verbs (prefix vs. locative particle), the syntax of subject and object pronouns and clitics (enclisis/proclisis), negative concord, pro-drop, complementizers, auxiliary selection, and restructuring.
The corpus also contains data on the Germanic minority languages of Timau and Sauris. The Timau data was collected during the PRIN 2017 ‘Models of language variation and change: new evidence from language contact’ project in the unit located at the University of Verona (Prof. Alessandra Tomaselli, Dott. Francesco Zuin). The Sauris data was collected in 2017 by Prof. Alessandra Tomaselli (University of Verona) and Prof. Ermenegildo Bidese (University of Trento). 
**Contact:** anne.kruijt@univr.it

## Authors
- Alessandra Tomaselli (University of Verona)
- Anne Kruijt (University of Verona)
- Birgit Alber (Free University of Bozen-Bolzano)
- Ermenegildo Bidese (University of Trento) 
- Jan Casalicchio (University of Trento)
- Patrizia Cordin (University of Trento)
- Joachim Kokkelmans (Free University of Bozen-Bolzano)
- Andrea Padovan (University of Verona)
- Stefan Rabanus (University of Verona)
- Francesco Zuin (University of Udine)

**Readme structure**
1. General
2. Abbreviations
3. Data Structure
4. Additional Information

## 1. General

* **Creator** Anne Kruijt

* **Date of creation** 2022-12

* **Size** 7.544 audio files (2,06GB); 2162 transcriptions

* **License** Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/ 

* **Acknowledgments** 
- *Project PRIN 2017 “Models of language variation and change: new evidence from language contact” 2017-2020, National coordinator: Maria Rita Manzini, University of Verona coordinator: Alessandra Tomaselli.
Project code: Prot. 2017K3NHHY 
Funding organization: Ministero dell’Università e della Ricerca;
- *Project AThEME (Advancing the European Multilingual Experience)*
Project grant: n. 613465
Funding organization: European Seventh Framework Programme for Research, Technological Development and Demonstration (EU)
- * General acknowledgments: We would like to thank all participants who dedicated their time and efforts to contributing to the project, and a special thanks to Lucia Protto (of the Centro etnografico 's haus van der Zahre” in Sauris) and Emily Siviero (Free University of Bozen) for their invaluable help in acquiring the contacts and data consent forms for the creation of this corpus.

## 2. Abbreviations

**Languages**

* "cim" = Cimbrian
* "lldfa" = Fassan (Ladin)
* "lldfo" = Fodom (Ladin)
* "mhn" = Mòcheno
* "tre" = Trentino
* "tir" = Tyrolean
* "vec" = Venetan
* "tis" = Timavese/Tischlbongarisch
* "zah" = Saurano/Zahrisch


**Phonological phenomena**

* "obstr" = obstruent consonants 
* "sch" = /s/ retraction
* "r" = realization of /r/ 
* "V" = vowel

**Syntactic phenomena**

* "DP" = Determiner Phrase 
* "COMP" = Complementizer
* "V2" = Verb second
* "SG" = singular
* "NP" = Noun Phrase


## 3. Data Structure

File structure under each language variety is identical and organized as follows: 

```
AThEME Verona-Trento Corpus 
¦   ReadMe.txt
¦ 
+--- Audio folders
¦      +  cim
¦         | F0190_cim_U11.flac
¦         | F0194_cim_U11.flac
¦         | S026_cim_U11.flac
¦         | S085_cim_U11.flac
¦         | S085_cim_U12.flac
¦         | ...
¦       + lldfa 
¦ 	  | ...(equivalent to "cim")
¦       + lldfo
¦ 	  | ...(equivalent to "cim")
¦       + mhn
¦ 	  | ...(equivalent to "cim")
¦       + tre
¦ 	  | ...(equivalent to "cim")
¦       + tir
¦ 	  | ...(equivalent to "cim")
¦       + vec
¦ 	  | ...(equivalent to "cim")
¦       + tis
¦ 	  | ...(equivalent to "cim")
¦       + zah
¦ 	  | ...(equivalent to "cim")
+--- Metadata
     +--- participants.ods
     +--- syntax.ods 
     +--- phonology.ods
     +--- syntax_transcription.ods

As can be seen, the AThEME Verona-Trento Corpus consists of two main folders: the audio folder, containing the segmented audio recordings collected from speakers and the metadata folder, containing tables with relevant linguistic information as well as sociolinguistic information about speakers, it also contains the transcriptions of the audio files where available. In addition, there is this readme file with the main information about the corpus.

**Audio folders**

There are audio recordings in nine language varieties for two different levels of linguistic analysis: phonology and syntax. The investigation of each linguistic domain involves a different type of linguistic stimulus: single words for phonology and entire sentences for syntax. 

The audio file name always mentions the stimulus ID, the language variety, and the user ID (e.g., S026_cim_U12). The first letter of the stimulus ID indicates the linguistic domain under investigation:

*"F" = Phonology (Fonologia) 
*"S" = Syntax (Sintassi)

The following numbers indicate the associated stimulus word/sentence which can be found in the files ‘phonology.ods’ and ‘syntax.ods’ located in the metadata folder (e.g., S026 is stimulus 26 of the syntactical questionnaire). 

The second part of the ID is formed by the code for the specific language variety, (e.g. cim indicates the language variety is Cimbrian):

* "cim" = Cimbrian
* "lldfa" = Fassan (Ladin)
* "lldfo" = Fodom (Ladin)
* "mhn" = Mòcheno
* "tre" = Trentino
* "tir" = Tyrolean
* "vec" = Venetan
* "tis" = Timavese/Tischlbongarisch
* "zah" = Saurano/Zahrisch

The last part of the audio ID is the User ID (e.g. U16) for which the sociolinguistic information can be found in the file ‘participants.ods’, located in the metadata folder. 

Some speakers recorded more than one audio file for the same stimulus. These files are reported in the corpus as follows: S042_cim_U13a, S042_cim_U13b, etc. 
 
**Metadata folder**

This folder contains four tables with the relevant information about the speakers and the linguistic stimuli:

* **Participants**

The speaker information includes:
- USER ID (included in the audio file name)
- Language variety
- Geographic location; name of community and GeoName (https://www.geonames.org/) code.
- Personal information; gender and age of participants
- Year of data collection
- Linguistic profile: language proficiency, frequency of use, and contexts of use, if they use the language with family and/or friends.

* **Phonology**
The phonological questionnaire investigates three main phonological phenomena across language varieties: 
- the obstruent consonants
- /s/ retraction before consonant
- the realization of /r/ 

These phenomena have been investigated within specific contexts. 

The metadata for the phonology section includes the following information:
- STIMULUS ID (included in the audio file name)
- Language variety
- Graphical rendering of item (Graphy)
- Gloss in Standard German and/or Standard Italian
- Investigated phenomenon
- Word context (e.g., word initial, medial, final position)
- Target phoneme/specific context under investigation
- VinKo Corpus (http://hdl.handle.net/20.500.12124/46) reference for words elicited in both corpora
- Notes: indicates items that were realized different from the target items, and the locations that the item was elicited in case of Trentino words, since not all items were elicited in every location.

* **Syntax**
This section investigates the following syntactic topics: 
- Syntax of the adjective within DP
- Syntax of clitics
- Negative concord
- Pro drop
- Complementizers
- Locative particles
- Auxiliary selection
- Pronouns
- Proper name syntax

The metadata for the syntax section includes the following information:
- STIMULUS ID (included in the audio file name)
- Sentence in Standard German and/or Standard Italian
- Verbal context of the stimulus in Standard German and/or Italian (where applicable).
- Syntactic topic
- Linguistic variable under investigation
- VinKo Corpus (http://hdl.handle.net/20.500.12124/46) ID for sentences elicited in both corpora
- ATHEME ID (for internal use only) 
- Questionnaire version/variety for which the sentence was elicited.

* **Transcriptions**

The transcription file includes:
- Sentence ID (identical to associated audio file name)
- Transcription; transcription of the audio file in local orthography.
- Stimulus sentence and verbal context
- Language variety
- VinKo Corpus (http://hdl.handle.net/20.500.12124/46) ID for sentences elicited in both corpora
- ATHEME ID (for internal use only) 
- Audio: yes or no. ‘Yes’ indicates that the corpus contains an audio file to go with the transcription. ‘No’ indicates that no audio is available. 

## 4. Additional information

* **mhn**

* The original phonological questionnaire used Tyrolean stimuli to elicit data, however the translations often did not provide the target phonemes. The questionnaire has therefore been restructured by Birgit Alber and Joachim Kokkelmans to fit the found target stimuli.

* **vec**

* Phonology questionnaire from Trentino was used to elicit the data.

* No audio files are available for the syntax questionnaires, only written transcriptions.

* **tis**

* Only a syntactical questionnaire was taken in Timau, therefore no phonological items exist for this variety. 

* **zah**

* Only a syntactical questionnaire was taken in Saurano, therefore no phonological items exist for this variety. 



**Scientific Publications** 

* Alber, Birgit, and Joachim Kokkelmans. 2022. “Typology and language change: The case of truncation”. In: O. Matushansky, L. Roussarie, M. Russo, E. Soare, & S. Wauquier (Eds.), Isogloss: Romance Languages and Linguistic Theory 17, Vol. 8: 1–17. 
https://doi.org/10.5565/rev/isogloss.124
*Alber, Birgit, Joachim Kokkelmans, and Stefan Rabanus. 2021. “Preconsonantal s -Retraction in the Alps: Germanic, Romance, Slavic.” STUF - Language Typology and Universals 74 (1): 17–38. https://doi.org/10.1515/stuf-2021-1022.
*Alber, Birgit, and Stefan Rabanus. 2018. “Die Sibilanten Des Zimbrischen: Sprachkontakt Konservativität Durch.” In: Stefan Rabanus (Ed.), Deutsch Als Minderheitensprache in Italien: Theorie Und Empirie Kontaktinduzierten Sprachwandels. Hildesheim/Zürich/New York: Olms: 19–47.
* Bidese, Ermenegildo, Andrea Padovan and Claudia Turolla. 2018. “Mehrsprachigkeit in den zimbrischen Sprachinseln anhand einiger syntaktischen Phänomene”. In: Nicole Eller-Wildfeuer, Paul Rössler, Alfred Wildfeuer (Eds.), Alpindeutsch. Einfluss und Verwendung des Deutschen im alpinen Raum; Jahrbuch der Johann-Andreas-SchmellerGesellschaft 2017. Ed Vulpes. ISBN 978-3-939112-77-8
* Bidese, Ermenegildo, Andrea Padovan and Claudia Turolla. 2019. “Adjective Orders in Cimbrian DPs”. In: Andreas Trotzke and Eva Wittenberg (Eds.), LINGUISTICS, vol. 57 Special Issue: Adjective order through a Germanic lens: 373-394, ISSN: 1613-396X, doi: 10.1515/ling-2019-0004
* Casalicchio, Jan, and Patrizia Cordin. 2020. Grammar of Central Trentino: A Romance Dialect from North-East Italy. Brill. https://brill.com/view/title/55777. 
* Casalicchio, Jan, and Andrea Padovan. 2018. “Komplementierer im Zimbrischen”. In: Stefan Rabanus (ed.), Deutsch als Minderheitensprache in Italien. Theorie und Empirie kontaktinduzierten Sprachwandel. Germanistische Linguistik: Themenheft.
*Casalicchio, Jan, and Andrea Padovan. 2019. “Contact-induced phenomena in the Alps”. In: S. Cruschina, A. Ledgeway and E.M. Remberger (Eds.), Italian Dialectology at the Interface. Amsterdam: John Benjamins [Linguistik Aktuell/Linguistics Today]: 237-255.
* Casalicchio, Jan. 2020. “Capitolo 4: Il ladini e I suoi idiomi”. In: Paul Videsott, Ruth Videsott, and Jan Casalicchio (Eds.), Manuale di Linguistica Ladina. Berlin: De Gruyter [Manuals of Romance Linguistics/Manuels de Linguistique Romane]: 144-201.
* Cordin, Patrizia, Stefan Rabanus, Birgit Alber, Antonio Mattei, Jan Casalicchio, Alessandra Tomaselli, Ermenegildo Bidese, and Andrea Padovan. 2018. “VinKo, Versione 2 (20.12.2018, 09:20).” In: Korpus Im Text, Serie A, 13739. http://www.kit.gwi.uni-muenchen.de/?p=13739&v=2.
* Kruijt, Anne, Patrizia Cordin, and Stefan Rabanus. 2023. "On the validity of crowdsourced data". In: Elissa Pustka, Carmen Quijada Van den Berghe, and Verena Weiland (eds.): Corpus Dialectology: from methods to theory (French, Italian, Spanish). Amsterdam / Philadelphia: John Benjamins Publishing Company, pp.
* Padovan, Andrea, Alessandra Tomaselli, Myrthe Bergstra, Norbert Corver, Ricardo Etxepare, and Simon Dold. "Minority languages in language contact situations: three case studies on language change." Us Wurk 65, no. 3-4 (2016): 146-174.
* Rabanus, Stefan. 2018. “Varietà Alloglotte – Tedesco.” In: Thomas Krefeld and Roland Bauer (Eds.), Lo Spazio Comunicativo Dell’Italia e Delle Varietà Italiane. Korpus Im Text. Versione 15. http://www.kit.gwi.uni-muenchen.de/?p=13187&v=1.