News
- 03/12/2020: v5.0 is available. We added the suitcase corpus, which contains un-scripted speech and corresponding annotations from 22 of the 24 speakers
- 06/06/2019: v4.0 is available. We re-examined some of the annotations and changed most of the “err” tags to more detailed (and informative) annotations — marking them as different deviations from standard English
- 04/08/2019: v3.0 is available, we added four (4) Vietnamese speakers!
- 09/28/2018: v2.0 is available, we added 10 new speakers!
- 08/20/2018: uploaded corpus description paper
- 06/03/2018: the corpus description paper was accepted at Interspeech’18! We will make the paper available online soon, and more speakers will be added over the following months
- 03/26/2018: v1.0 was released
Introduction
Welcome to the homepage of L2-ARCTIC, a speech corpus of non-native English intended for research in voice conversion, accent conversion, and mispronunciation detection. This corpus includes recordings from twenty-four (24) non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, Arabic and Vietnamese, each L1 containing recordings from two male and two female speakers. Each speaker recorded approximately one hour of read speech from CMU’s ARCTIC prompts, from which we generated orthographic and forced-aligned phonetic transcriptions. In addition, we manually annotated 150 utterances per speaker to identify three types of mispronunciation errors: substitutions, deletions, and additions, making it a valuable resource not only for research in voice conversion and accent conversion but also in computer-assisted pronunciation training.
The corpus is a joint effort of researchers at Texas A&M University and Iowa State University. In the future, we may also include speakers from other L1s if we find them to be useful to the research community.
Overview
For each speaker, the corpus contains the following data:
- Speech recordings: over one hour of prompted recordings of phonetically-balanced short sentences (~1132)
- Word level transcriptions: orthographic transcription and forced-aligned word boundaries for each sentence
- Phoneme level transcriptions: forced-aligned phoneme transcription for each sentence
- Manual annotations: a selected subset of utterances (~150), including 100 sentences produced by all speakers and 50 sentences that include phonemes likely to be difficult according to each speaker’s L1, all annotated with corrected word and phone boundaries; phone substitution, deletion, and addition errors are also tagged
Dataset examples
Audio
Speaker | L1 | Gender | Audio |
---|---|---|---|
ABA | Arabic | M | |
SKA | Arabic | F | |
YBAA | Arabic | M | |
ZHAA | Arabic | F | |
BWC | Mandarin | M | |
LXC | Mandarin | F | |
NCC | Mandarin | F | |
TXHC | Mandarin | M | |
ASI | Hindi | M | |
RRBI | Hindi | M | |
SVBI | Hindi | F | |
TNI | Hindi | F | |
HJK | Korean | F | |
HKK | Korean | M | |
YDCK | Korean | F | |
YKWK | Korean | M | |
EBVS | Spanish | M | |
ERMS | Spanish | M | |
MBMPS | Spanish | F | |
NJS | Spanish | F | |
HQTV | Vietnamese | M | |
PNV | Vietnamese | F | |
THV | Vietnamese | F | |
TLV | Vietnamese | M |
Annotations
A TextGrid with manual annotations (LXC, arctic_a0018; wav file, TextGrid file). Top to bottom: speech waveform, spectrogram, words, phonemes, error tags, and comments from the annotator
Access guidelines
- Please first review the license terms of L2-ARCTIC
- If you agree to the license terms, please go to the download section below and fill in your name, email address, and affiliation, then click “Download”
- An automated email will be sent to the address you provided with the direct download link. Please make sure to provide a correct email address
- Generally, the email will be sent to you immediately. If you haven’t received it within ten minutes, please check your spam box and look for an email titled “Access to L2-ARCTIC corpus” or add “adas
@tamu.edu” to your whitelist and then submit the download form again - We use reCAPTCHA to filter attackers/spams, and based on our observations, some email providers (e.g., 163.com, qq.com, hotmail.com) are more likely to be misclassified as spammers than the others. Please take this into consideration when you choose the email address you would like to submit
- The corpus is hosted on Google Drive, please contact us if your organization REQUIRES alternative storage methods for accessing external data. We might be able to arrange alternative data accessing methods at our own discretion
- If you encounter any difficulty accessing L2-ARCTIC, please feel free to contact Anurag Das (adas@tamu.edu) for assistance
Documentation
Link to documentation and publication. Please note that the corpus description paper we published at Interspeech’18 was written when we made the v1.0 data release.
License
The corpus is released under the CC BY-NC 4.0 license, a summary of the license can be found here, and the full license can be found here. For any usage that is not covered by the CC BY-NC 4.0 license, please contact Dr. Ricardo Gutierrez-Osuna (rgutier@tamu.edu).
Please cite the following paper if you used L2-ARCTIC for any publications,
Download
Contact
Guanlong Zhao (gzhao@tamu.edu), Department of Computer Science & Engineering @ TAMU
Sinem Sonsaat (sonsaat@iastate.edu), Department of English @ Iowas State University
Alif Silpachai (alif@iastate.edu), Department of English @ Iowas State University
Ivana Lucic (ilucic@iastate.edu), Department of English @ Iowas State University
Evgeny Chukharev-Hudilainen (evgeny@iastate.edu), Department of English @ Iowas State University
John Levis (jlevis@iastate.edu), Department of English @ Iowas State University
Ricardo Gutierrez-Osuna (rgutier@tamu.edu), Department of Computer Science & Engineering @ TAMU
Acknowledgments
The curation of the L2-ARCTIC corpus was supported by NSF awards 1619212 and 1623750. We would like to thank the anonymous participants for recording the corpus. We also would like to thank Ziwei Zhou and Taylor Anne Barriuso for their assistance with the annotations. We appreciate suggestions from Christopher Liberatore, Shaojin Ding, and the reviewers at Interspeech’18.
External links
CMU ARCTIC speech database
Speech Accent Archive
IDEA: The International Dialects of English Archive
Kaldi-gop
Guanlong Zhao’s external homepage
Dr. John Levis’ homepage
Dr. Evgeny Chukharev-Hudilainen’s homepage