L2-ARCTIC documentation

Summary

The L2-ARCTIC corpus is a speech corpus of non-native English that is intended for research in voice conversion, accent conversion, and mispronunciation detection. In total, the corpus contains 26,867 utterances from 24 non-native speakers with a balanced gender and L1 distribution. Most speakers recorded the full CMU ARCTIC set. The total duration of the corpus is 27.1 hours, with an average of 67.7 minutes (std: 8.6 minutes) of speech per L2 speaker. On average, each utterance is 3.6 seconds in duration. The pause before and after each utterance is generally no longer than 100 ms. Using the forced alignment results, we estimate a speech to silence ratio of 7:1 across the whole dataset. The dataset contains over 238,702 word segments, giving an average of around nine (9) words per utterance, and over 851,830 phone segments (excluding silence).
Human annotators manually examined 3,599 utterances, annotating 14,098 phone substitutions, 3,420 phone deletions, and 1,092 phone additions.
Some speakers did not read all sentences, and a few sentences were removed for some speakers since those recordings did not have the required quality.

About the suitcase corpus (added on March 12, 2020)

This portion of the L2-ARCTIC corpus involves spontaneous speech. We include recordings and annotations from 22 of the 24 speakers who recorded the sentences. Speakers SKA and ASI did not participate in this task. Each speaker retold a story from a picture narrative used in applied linguistics research on comprehensibility, accentedness, and intelligibility. The pictures are generally known as the suitcase story. Each retelling of the narrative was done after looking over the story and asking the researchers questions about what was happening. Few participants had questions. The annotations were carried out by two research assistants trained in phonetic transcription. Each did half of the transcriptions, then checked the other half done by the other research assistant. Finally, all transcriptions were checked by John Levis, a co-PI for the project. This project was funded by National Science Foundation award 1618953, titled “Developing Golden Speakers for Second-Language Pronunciation.”

The total duration of this subset is 26.1 minutes, with an average of 1.2 minutes (std: 41.5 seconds) per speaker. Using the manual annotation results, we estimate a speech to silence ratio of 2.3:1 across the whole dataset. The dataset contains around 3,083 word segments, giving an average of 140 words per recording, and around 9,458 phone segments (excluding silence). The manual annotations include 1,673 phone substitutions, 456 phone deletions, and 90 phone additions.

Manual annotation

Each manually annotated TextGrid file will always have a “words” and “phones” tier, but some files may have an additional tier that contains comments from the annotators. Each phone segment was tagged in the “phones” tier following the conventions below:

Correctly pronounced: the existing forced-alignment label was unchanged
Phone substitution error: We changed the label to “CPL,PPL,s”, where “CPL” is the correct phoneme label (i.e., what should have been produced), “PPL” is the perceived phoneme label (i.e., what was actually produced), and “s” stands for substitution. If the perceived phoneme label was hard to judge, we used the “err” tag as its phoneme label, i.e., tagged it as “CPL,err,s”. If the perceived phoneme sounded like a deviation of the standard American English pronunciation, we marked it using a “deviation” symbol “*”. For example, if the correct label for a phone segment is “AH”, and the speaker pronounced it as an “AO” with a foreign accent, we mark this error as “AH,AO*,s”
Phone addition error: we created an empty interval in the “phones” tier, adjusted its boundaries, and changed its label to “sil,PPL,a”, where “sil” stands for silence, and “a” stands for addition (insertion). If the perceived phoneme label was not in the American English phoneme set, we used the “err” and “*” tags as noted above
Phone deletion error: we found a silent segment where that phone segment should be, and annotated it as “CPL,sil,d”, where “d” stands for deletion

The “phones” tier will only contain the ARPAbet symbols while the comments may contain IPA symbols (in UTF8 format).

Directory structure

Data for each speaker is organized in its own subdirectory under the root folder; the root folder also contains a README.md file, a README.pdf file (the README.md converted to PDF), a LICENSE file, and a PROMPTS file (containing the original text prompts from CMU ARCTIC). Each speaker’s directory is structured as follows:

/wav: containing audio files in WAV format, sampled at 44.1 kHz
/transcript: containing orthographic transcriptions, saved in TXT format
/textgrid: containing phoneme transcriptions generated from forced-alignment, saved in TextGrid format
/annotation: containing manual annotations, saved in TextGrid format

The suitcase corpus is stored in a subfolder named suitcase_corpus under the root folder and follows a similar directory structure, except that we did not include the forced-aligned TextGrid files, since the manual annotations contain more accurate alignments. All files in the suitcase corpus are named by speaker codes.

File summary

Speaker	# Wav Files	# Annotations
ABA	1129	150
SKA	974	150
YBAA	1130	149
ZHAA	1132	150
BWC	1130	150
LXC	1131	150
NCC	1131	150
TXHC	1132	150
ASI	1131	150
RRBI	1130	150
SVBI	1132	150
TNI	1131	150
HJK	1131	150
HKK	1131	150
YDCK	1131	150
YKWK	1131	150
EBVS	1007	150
ERMS	1132	150
MBMPS	1132	150
NJS	1131	150
HQTV	1132	150
PNV	1132	150
THV	1132	150
TLV	1132	150
Total	26867	3599

Phoneme set

Helpful toolkits

Below are some useful tools we used to access TextGrid files,

Praat: visualizing and modifying TextGrid files in a GUI
mPraat: read/write TextGrid files in Matlab
TextGridTools: read/write TextGrid files in Python

Revision history

03/12/2020: v5.0, add the suitcase corpus, which contains un-scripted speech and corresponding annotations from 22 of the 24 speakers
06/06/2019: v4.0 is available. We re-examined some of the annotations and changed most of the “err” tags to more detailed (and informative) annotations — marking them as different deviations from standard English
04/08/2019: v3.0, add 4 Vietnamese speakers to the corpus
09/28/2018: v2.0, add 10 new speakers to the corpus
03/26/2018: v1.0, the initial release

Disclaimer

We may recommend the use of software, information, products, or websites that are owned or operated by other parties. We offer or facilitate this recommendation by hyperlinks or other methods to aid your access to the third-party resource. While we endeavor to direct you to helpful, trustworthy resources, we cannot endorse, approve, or guarantee software, information, products, or services provided by or at a third-party resource or track changes in the resource. Thus, we are not responsible for the content or accuracy of any third-party resource or for any loss or damage of any sort resulting from the use of, or for any failure of, products or services provided at or from a third party resource. We recommend these resources on an “as is” basis. When you use a third-party resource, you will be subject to its terms and licenses and no longer be protected by our privacy policy or security practices, which may differ from the third policy or practices or other terms. You should familiarize yourself with any license or use terms of, and the privacy policy and security practices of, the third party resource, which will govern your use of that resource.