L2-ARCTIC: a non-native English speech corpus

News

  • 03/12/2020: v5.0 is available. We added the suitcase corpus, which contains un-scripted speech and corresponding annotations from 22 of the 24 speakers
  • 06/06/2019: v4.0 is available. We re-examined some of the annotations and changed most of the “err” tags to more detailed (and informative) annotations — marking them as different deviations from standard English
  • 04/08/2019: v3.0 is available, we added four (4) Vietnamese speakers!
  • 09/28/2018: v2.0 is available, we added 10 new speakers!
  • 08/20/2018: uploaded corpus description paper
  • 06/03/2018: the corpus description paper was accepted at Interspeech’18! We will make the paper available online soon, and more speakers will be added over the following months
  • 03/26/2018: v1.0 was released

Introduction

Welcome to the homepage of L2-ARCTIC, a speech corpus of non-native English intended for research in voice conversion, accent conversion, and mispronunciation detection. This corpus includes recordings from twenty-four (24) non-native speakers of English whose first languages (L1s) are Hindi, Korean, Mandarin, Spanish, Arabic and Vietnamese, each L1 containing recordings from two male and two female speakers. Each speaker recorded approximately one hour of read speech from CMU’s ARCTIC prompts, from which we generated orthographic and forced-aligned phonetic transcriptions. In addition, we manually annotated 150 utterances per speaker to identify three types of mispronunciation errors: substitutions, deletions, and additions, making it a valuable resource not only for research in voice conversion and accent conversion but also in computer-assisted pronunciation training.
The corpus is a joint effort of researchers at Texas A&M University and Iowa State University. In the future, we may also include speakers from other L1s if we find them to be useful to the research community.

Overview

For each speaker, the corpus contains the following data:

  • Speech recordings: over one hour of prompted recordings of phonetically-balanced short sentences (~1132)
  • Word level transcriptions: orthographic transcription and forced-aligned word boundaries for each sentence
  • Phoneme level transcriptions: forced-aligned phoneme transcription for each sentence
  • Manual annotations: a selected subset of utterances (~150), including 100 sentences produced by all speakers and 50 sentences that include phonemes likely to be difficult according to each speaker’s L1, all annotated with corrected word and phone boundaries; phone substitution, deletion, and addition errors are also tagged

Dataset examples

Audio
Speaker L1 Gender Audio
ABA Arabic M
SKA Arabic F
YBAA Arabic M
ZHAA Arabic F
BWC Mandarin M
LXC Mandarin F
NCC Mandarin F
TXHC Mandarin M
ASI Hindi M
RRBI Hindi M
SVBI Hindi F
TNI Hindi F
HJK Korean F
HKK Korean M
YDCK Korean F
YKWK Korean M
EBVS Spanish M
ERMS Spanish M
MBMPS Spanish F
NJS Spanish F
HQTV Vietnamese M
PNV Vietnamese F
THV Vietnamese F
TLV Vietnamese M
Annotations

A TextGrid with manual annotations (LXC, arctic_a0018; wav file, TextGrid file). Top to bottom: speech waveform, spectrogram, words, phonemes, error tags, and comments from the annotator

Access guidelines

  • Please first review the license terms of L2-ARCTIC
  • If you agree to the license terms, please go to the download section below and fill in your name, email address, and affiliation, then click “Download”
  • An automated email will be sent to the address you provided with the direct download link. Please make sure to provide a correct email address
  • Generally, the email will be sent to you immediately. If you haven’t received it within ten minutes, please check your spam box and look for an email titled “Access to L2-ARCTIC corpus” or add “adas@tamu.edu” to your whitelist and then submit the download form again
  • We use reCAPTCHA to filter attackers/spams, and based on our observations, some email providers (e.g., 163.com, qq.com, hotmail.com) are more likely to be misclassified as spammers than the others. Please take this into consideration when you choose the email address you would like to submit
  • The corpus is hosted on Google Drive, please contact us if your organization REQUIRES alternative storage methods for accessing external data. We might be able to arrange alternative data accessing methods at our own discretion
  • If you encounter any difficulty accessing L2-ARCTIC, please feel free to contact Anurag Das (adas@tamu.edu) for assistance

Documentation

Link to documentation and publication. Please note that the corpus description paper we published at Interspeech’18 was written when we made the v1.0 data release.

License

The corpus is released under the CC BY-NC 4.0 license, a summary of the license can be found here, and the full license can be found here. For any usage that is not covered by the CC BY-NC 4.0 license, please contact Dr. Ricardo Gutierrez-Osuna (rgutier@tamu.edu).
Please cite the following paper if you used L2-ARCTIC for any publications,

@inproceedings{zhao2018l2arctic,
    author={Guanlong {Zhao} and Sinem {Sonsaat} and Alif {Silpachai} and Ivana {Lucic} and Evgeny {Chukharev-Hudilainen} and John {Levis} and Ricardo {Gutierrez-Osuna}},
    title={L2-ARCTIC: A Non-native English Speech Corpus},
    year=2018,
    booktitle={Proc. Interspeech},
    pages={2783–2787},
    doi={10.21437/Interspeech.2018-1110},
    url={http://dx.doi.org/10.21437/Interspeech.2018-1110}
}

Download





    Contact

    Guanlong Zhao (gzhao@tamu.edu), Department of Computer Science & Engineering @ TAMU
    Sinem Sonsaat (sonsaat@iastate.edu), Department of English @ Iowas State University
    Alif Silpachai (alif@iastate.edu), Department of English @ Iowas State University
    Ivana Lucic (ilucic@iastate.edu), Department of English @ Iowas State University
    Evgeny Chukharev-Hudilainen (evgeny@iastate.edu), Department of English @ Iowas State University
    John Levis (jlevis@iastate.edu), Department of English @ Iowas State University
    Ricardo Gutierrez-Osuna (rgutier@tamu.edu), Department of Computer Science & Engineering @ TAMU

    Acknowledgments

    The curation of the L2-ARCTIC corpus was supported by NSF awards 1619212 and 1623750. We would like to thank the anonymous participants for recording the corpus. We also would like to thank Ziwei Zhou and Taylor Anne Barriuso for their assistance with the annotations. We appreciate suggestions from Christopher Liberatore, Shaojin Ding, and the reviewers at Interspeech’18.

    External links

    CMU ARCTIC speech database
    Speech Accent Archive
    IDEA: The International Dialects of English Archive
    Kaldi-gop
    Guanlong Zhao’s external homepage
    Dr. John Levis’ homepage
    Dr. Evgeny Chukharev-Hudilainen’s homepage