Speech corpus  

From The Art and Popular Culture Encyclopedia

Jump to: navigation, search

Related e



Kunstformen der Natur (1904) by Ernst Haeckel
Kunstformen der Natur (1904) by Ernst Haeckel

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In Speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition engine). In Linguistics, spoken corpora are used to do research into Phonetic, Conversation analysis, Dialectology and other fields.

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of Speech Corpora:

  • (1) Read Speech - which includes:
  • Book excerpts
  • Broadcast news
  • Lists of words
  • Sequences of numbers
  • (2) Spontaneous Speech - which includes:
  • Dialogs - between two or more people (includes meetings);
  • Narratives - a person telling a story (one such corpus is the Buckeye Corpus);
  • Map-tasks - one person explains a route on a map to another;
  • Appointment-tasks - two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with foreign accent.

See also


  • Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data – Transcription and Coding in Discourse Research. Hillsdale: Erlbaum.
  • Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.

Unless indicated otherwise, the text in this article is either based on Wikipedia article "Speech corpus" or another language Wikipedia page thereof used under the terms of the GNU Free Documentation License; or on original research by Jahsonic and friends. See Art and Popular Culture's copyright notice.

Personal tools