Lancaster-Los Angeles Spoken Chinese Corpus

Taxonomy :

The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) is a corpus of spoken Mandarin Chinese. It consists of dialogues (55%) and monologues (45%) including both spontaneous (57%) and scripted (43%) speech.


- Other info -

Language(s) :

spoken Mandarin Chinese

Types : monolingual corpus
Domain : face-to-face conversation telephone conversation between overseas Chinese and their family in China play/movie scripts TV talk show transcripts formal debates between university students recorded between 1993 and 2002 spontaneous oral narratives of native Beijing residents edited oral narratives
Size : 1,002,151 words, corresponding to 73,976 sentences and 49,670 utterance units (paragraphs)
Developer : Hongyin Tao Richard Xiao
Availability : Free
Update: 2006