- 6/3/2018: the corpus description paper was accepted at Interspeech’18! We will make the paper available online soon, and more speakers will be added over the following months
- 3/26/2018: v1.0 was released
Welcome to the homepage of L2-ARCTIC , a speech corpus of non-native English intended for research in voice conversion, accent conversion, and mispronunciation detection. This initial release includes recordings from ten non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, and Arabic, each L1 containing recordings from one male and one female speaker. Each speaker recorded approximately one hour of read speech from CMU’s ARCTIC prompts, from which we generated orthographic and forced-aligned phonetic transcriptions. In addition, we manually annotated 150 utterances per speaker to identify three types of mispronunciation errors: substitutions, deletions, and additions, making it a valuable resource not only for research in voice conversion and accent conversion but also in computer-assisted pronunciation training.
The corpus is a joint effort of researchers at Texas A&M University and Iowa State University. We are in the process on adding ten (10) more speakers to the corpus, in time for Interspeech’18 (September, 2018). In the future, we may also include speakers from other L1s if we find them to be useful to the research community.
For each speaker, the corpus contains the following data:
- Speech recordings: over one hour of prompted recordings of phonetically-balanced short sentences (~1132)
- Word level transcriptions: orthographic transcription and forced-aligned word boundaries for each sentence
- Phoneme level transcriptions: forced-aligned phoneme transcription for each sentence
- Manual annotations: a selected subset of utterances (~150), including 100 sentences produced by all speakers and 50 sentences that include phonemes likely to be difficult according to each speaker’s L1, all annotated with corrected word and phone boundaries; phone substitution, deletion, and addition errors are also tagged
Please first review the license terms of L2-ARCTIC. If you agree to those terms, please go to the download section below and fill in your name, email address, and affiliation, then click “Download.” An automated email will be sent to the address you provided with the direct download link. Please make sure to provide a correct email address.
Generally, the email will be sent to you immediately. If you haven’t received it within ten minutes, please check your spam box and look for an email titled “Access to L2-ARCTIC corpus” or add “gzhao
The corpus is released under the CC BY-NC 4.0 license, a summary of the license can be found here, and the full license can be found here.
Please cite “L2-ARCTIC: a non-native English speech corpus” (accepted at Interspeech’18) if you used L2-ARCTIC for any publications.
Guanlong Zhao (firstname.lastname@example.org), Department of Computer Science & Engineering @ TAMU
Sinem Sonsaat (email@example.com), Department of English @ Iowas State University
Alif Silpachai (firstname.lastname@example.org), Department of English @ Iowas State University
Ivana Lucic (email@example.com), Department of English @ Iowas State University
Evgeny Chukharev-Hudilainen (firstname.lastname@example.org), Department of English @ Iowas State University
John Levis (email@example.com), Department of English @ Iowas State University
Ricardo Gutierrez-Osuna (firstname.lastname@example.org), Department of Computer Science & Engineering @ TAMU
Curation of the L2-ARCTIC corpus was supported by NSF awards 1619212 and 1623750. We would like to thank the anonymous participants for recording the corpus. We also would like to thank Ziwei Zhou for his assistance with the annotations. We appreciate suggestions from Christopher Liberatore and Shaojin Ding.
CMU ARCTIC speech database
Speech Accent Archive
IDEA: The International Dialects of English Archive
Dr. John Levis’ homepage
Dr. Sinem Sonsaat’s homepage
Dr. Evgeny Chukharev-Hudilainen’s homepage