L2-ARCTIC: a non-native English speech corpus

News

  • 9/28/2018: v2.0 is available, we added 10 new speakers!
  • 8/20/2018: uploaded corpus description paper
  • 6/3/2018: the corpus description paper was accepted at Interspeech’18! We will make the paper available online soon, and more speakers will be added over the following months
  • 3/26/2018: v1.0 was released

Introduction

Welcome to the homepage of L2-ARCTIC, a speech corpus of non-native English intended for research in voice conversion, accent conversion, and mispronunciation detection. This corpus includes recordings from twenty (20) non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, and Arabic, each L1 containing recordings from two male and two female speakers. Each speaker recorded approximately one hour of read speech from CMU’s ARCTIC prompts, from which we generated orthographic and forced-aligned phonetic transcriptions. In addition, we manually annotated 150 utterances per speaker to identify three types of mispronunciation errors: substitutions, deletions, and additions, making it a valuable resource not only for research in voice conversion and accent conversion but also in computer-assisted pronunciation training.
The corpus is a joint effort of researchers at Texas A&M University and Iowa State University. In the future, we may also include speakers from other L1s if we find them to be useful to the research community.

Overview

For each speaker, the corpus contains the following data:

  • Speech recordings: over one hour of prompted recordings of phonetically-balanced short sentences (~1132)
  • Word level transcriptions: orthographic transcription and forced-aligned word boundaries for each sentence
  • Phoneme level transcriptions: forced-aligned phoneme transcription for each sentence
  • Manual annotations: a selected subset of utterances (~150), including 100 sentences produced by all speakers and 50 sentences that include phonemes likely to be difficult according to each speaker’s L1, all annotated with corrected word and phone boundaries; phone substitution, deletion, and addition errors are also tagged

Dataset examples

Audio
Speaker L1 Gender Audio
ABA Arabic M
SKA Arabic F
YBAA Arabic M
ZHAA Arabic F
BWC Mandarin M
LXC Mandarin F
NCC Mandarin F
TXHC Mandarin M
ASI Hindi M
RRBI Hindi M
SVBI Hindi F
TNI Hindi F
HJK Korean F
HKK Korean M
YDCK Korean F
YKWK Korean M
EBVS Spanish M
ERMS Spanish M
MBMPS Spanish F
NJS Spanish F
Annotations

A TextGrid with manual annotations (LXC, arctic_a0018; wav file, TextGrid file). Top to bottom: speech waveform, spectrogram, words, phonemes, error tags, and comments from the annotator

Access guidelines

Please first review the license terms of L2-ARCTIC. If you agree to those terms, please go to the download section below and fill in your name, email address, and affiliation, then click “Download.” An automated email will be sent to the address you provided with the direct download link. Please make sure to provide a correct email address.
Generally, the email will be sent to you immediately. If you haven’t received it within ten minutes, please check your spam box and look for an email titled “Access to L2-ARCTIC corpus” or add “gzhao@tamu.edu” to your whitelist and then submit the download form again. If you encountered any difficulty accessing L2-ARCTIC, please feel free to contact Guanlong Zhao (gzhao@tamu.edu) for assistance.

Documentation

Link to documentation and publication.

License

The corpus is released under the CC BY-NC 4.0 license, a summary of the license can be found here, and the full license can be found here. For any usage that is not covered by the CC BY-NC 4.0 license, please contact Dr. Ricardo Gutierrez (rgutier@tamu.edu).
Please cite the following paper if you used L2-ARCTIC for any publications,

@inproceedings{zhao2018l2arctic,
    author={Guanlong {Zhao} and Sinem {Sonsaat} and Alif {Silpachai} and Ivana {Lucic} and Evgeny {Chukharev-Hudilainen} and John {Levis} and Ricardo {Gutierrez-Osuna}},
    title={L2-ARCTIC: A Non-native English Speech Corpus},
    year=2018,
    booktitle={Proc. Interspeech 2018},
    pages={2783–2787},
    doi={10.21437/Interspeech.2018-1110},
    url={http://dx.doi.org/10.21437/Interspeech.2018-1110}
}

Download




Contact

Guanlong Zhao (gzhao@tamu.edu), Department of Computer Science & Engineering @ TAMU
Sinem Sonsaat (sonsaat@iastate.edu), Department of English @ Iowas State University
Alif Silpachai (alif@iastate.edu), Department of English @ Iowas State University
Ivana Lucic (ilucic@iastate.edu), Department of English @ Iowas State University
Evgeny Chukharev-Hudilainen (evgeny@iastate.edu), Department of English @ Iowas State University
John Levis (jlevis@iastate.edu), Department of English @ Iowas State University
Ricardo Gutierrez-Osuna (rgutier@tamu.edu), Department of Computer Science & Engineering @ TAMU

Acknowledgments

Curation of the L2-ARCTIC corpus was supported by NSF awards 1619212 and 1623750. We would like to thank the anonymous participants for recording the corpus. We also would like to thank Ziwei Zhou and Taylor Anne Barriuso for their assistance with the annotations. We appreciate suggestions from Christopher Liberatore, Shaojin Ding, and the reviewers at Interspeech’18.

External links

CMU ARCTIC speech database
Speech Accent Archive
IDEA: The International Dialects of English Archive
Kaldi-gop
Dr. John Levis’ homepage
Dr. Sinem Sonsaat’s homepage
Dr. Evgeny Chukharev-Hudilainen’s homepage
Alif Silpachai’s homepage