L2-ARCTIC: a non-native English speech corpus

News

03/12/2020: v5.0 is available. We added the suitcase corpus, which contains un-scripted speech and corresponding annotations from 22 of the 24 speakers
06/06/2019: v4.0 is available. We re-examined some of the annotations and changed most of the “err” tags to more detailed (and informative) annotations — marking them as different deviations from standard English
04/08/2019: v3.0 is available, we added four (4) Vietnamese speakers!
09/28/2018: v2.0 is available, we added 10 new speakers!
08/20/2018: uploaded corpus description paper
06/03/2018: the corpus description paper was accepted at Interspeech’18! We will make the paper available online soon, and more speakers will be added over the following months
03/26/2018: v1.0 was released

Introduction

Welcome to the homepage of L2-ARCTIC, a speech corpus of non-native English intended for research in voice conversion, accent conversion, and mispronunciation detection. This corpus includes recordings from twenty-four (24) non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, Arabic and Vietnamese, each L1 containing recordings from two male and two female speakers. Each speaker recorded approximately one hour of read speech from CMU’s ARCTIC prompts, from which we generated orthographic and forced-aligned phonetic transcriptions. In addition, we manually annotated 150 utterances per speaker to identify three types of mispronunciation errors: substitutions, deletions, and additions, making it a valuable resource not only for research in voice conversion and accent conversion but also in computer-assisted pronunciation training.
The corpus is a joint effort of researchers at Texas A&M University and Iowa State University. In the future, we may also include speakers from other L1s if we find them to be useful to the research community.

Overview

For each speaker, the corpus contains the following data:

Speech recordings: over one hour of prompted recordings of phonetically-balanced short sentences (~1132)
Word level transcriptions: orthographic transcription and forced-aligned word boundaries for each sentence
Phoneme level transcriptions: forced-aligned phoneme transcription for each sentence
Manual annotations: a selected subset of utterances (~150), including 100 sentences produced by all speakers and 50 sentences that include phonemes likely to be difficult according to each speaker’s L1, all annotated with corrected word and phone boundaries; phone substitution, deletion, and addition errors are also tagged

Dataset examples

Audio

Speaker	L1	Gender
ABA	Arabic	M
SKA	Arabic	F
YBAA	Arabic	M
ZHAA	Arabic	F
BWC	Mandarin	M
LXC	Mandarin	F
NCC	Mandarin	F
TXHC	Mandarin	M
ASI	Hindi	M
RRBI	Hindi	M
SVBI	Hindi	F
TNI	Hindi	F
HJK	Korean	F
HKK	Korean	M
YDCK	Korean	F
YKWK	Korean	M
EBVS	Spanish	M
ERMS	Spanish	M
MBMPS	Spanish	F
NJS	Spanish	F
HQTV	Vietnamese	M
PNV	Vietnamese	F
THV	Vietnamese	F
TLV	Vietnamese	M

Annotations

A TextGrid with manual annotations (LXC, arctic_a0018; wav file, TextGrid file). Top to bottom: speech waveform, spectrogram, words, phonemes, error tags, and comments from the annotator

Access guidelines

Please first review the license terms of L2-ARCTIC
If you agree to the license terms, please go to the download section below and fill in your name, email address, and affiliation, then click “Download”
An automated email will be sent to the address you provided with the direct download link. Please make sure to provide a correct email address
Generally, the email will be sent to you immediately. If you haven’t received it within ten minutes, please check your spam box and look for an email titled “Access to L2-ARCTIC corpus” or add “adas@tamu.edu” to your whitelist and then submit the download form again
We use reCAPTCHA to filter attackers/spams, and based on our observations, some email providers (e.g., 163.com, qq.com, hotmail.com) are more likely to be misclassified as spammers than the others. Please take this into consideration when you choose the email address you would like to submit
The corpus is hosted on Google Drive, please contact us if your organization REQUIRES alternative storage methods for accessing external data. We might be able to arrange alternative data accessing methods at our own discretion
If you encounter any difficulty accessing L2-ARCTIC, please feel free to contact Anurag Das (adas@tamu.edu) for assistance

Documentation

Link to documentation and publication. Please note that the corpus description paper we published at Interspeech’18 was written when we made the v1.0 data release.

License

The corpus is released under the CC BY-NC 4.0 license, a summary of the license can be found here, and the full license can be found here. For any usage that is not covered by the CC BY-NC 4.0 license, please contact Dr. Ricardo Gutierrez-Osuna (rgutier@tamu.edu).
Please cite the following paper if you used L2-ARCTIC for any publications,

@inproceedings{zhao2018l2arctic,

author={Guanlong {Zhao} and Sinem {Sonsaat} and Alif {Silpachai} and Ivana {Lucic} and Evgeny {Chukharev-Hudilainen} and John {Levis} and Ricardo {Gutierrez-Osuna}},

title={L2-ARCTIC: A Non-native English Speech Corpus},

year=2018,

booktitle={Proc. Interspeech},

pages={2783–2787},

doi={10.21437/Interspeech.2018-1110},

url={http://dx.doi.org/10.21437/Interspeech.2018-1110}

}

Download

Contact

Guanlong Zhao (gzhao@tamu.edu), Department of Computer Science & Engineering @ TAMU
Sinem Sonsaat (sonsaat@iastate.edu), Department of English @ Iowas State University
Alif Silpachai (alif@iastate.edu), Department of English @ Iowas State University
Ivana Lucic (ilucic@iastate.edu), Department of English @ Iowas State University
Evgeny Chukharev-Hudilainen (evgeny@iastate.edu), Department of English @ Iowas State University
John Levis (jlevis@iastate.edu), Department of English @ Iowas State University
Ricardo Gutierrez-Osuna (rgutier@tamu.edu), Department of Computer Science & Engineering @ TAMU

Acknowledgments

The curation of the L2-ARCTIC corpus was supported by NSF awards 1619212 and 1623750. We would like to thank the anonymous participants for recording the corpus. We also would like to thank Ziwei Zhou and Taylor Anne Barriuso for their assistance with the annotations. We appreciate suggestions from Christopher Liberatore, Shaojin Ding, and the reviewers at Interspeech’18.

External links

CMU ARCTIC speech database
Speech Accent Archive
IDEA: The International Dialects of English Archive
Kaldi-gop
Guanlong Zhao’s external homepage
Dr. John Levis’ homepage
Dr. Evgeny Chukharev-Hudilainen’s homepage