Simon/Contribute Data: Difference between revisions

    From KDE UserBase Wiki
    (Created page with "To build a speech recognition system, several types of data files are required: * A phonetic dictionary to learn how words are pronounced * Transcribed audio samples to learn ...")
     
    No edit summary
    Line 1: Line 1:
    To build a speech recognition system, several types of data files are required:
    To build a speech recognition system, several types of data files are required:
    * A phonetic dictionary to learn how words are pronounced
    * Transcribed audio samples to learn how a human pronounces the sounds (phonemes) that make up the dictionary
    * Transcribed audio samples to learn how a human pronounces the phonetic elements from the dictionary (phones)
    * Large corpora of written text to learn what word structures commonly co-occur (provides context for the recognizer)
    * Large corpora of written text to learn what word structures commonly co-occur (provides context for the recognizer)
    For us to develop a system that you can use out-of-the-box, we depend on large collections of such data. If you can contribute to any of these collections (regardless of the language), please consider [mailto:[email protected] getting in touch with us]!
    Should your corpus contain data from other people besides yourself, please make sure that you have the required permissions to distribute the data set.
    == Recordings ==
    == Text ==

    Revision as of 14:29, 10 August 2013

    To build a speech recognition system, several types of data files are required:

    • Transcribed audio samples to learn how a human pronounces the sounds (phonemes) that make up the dictionary
    • Large corpora of written text to learn what word structures commonly co-occur (provides context for the recognizer)

    For us to develop a system that you can use out-of-the-box, we depend on large collections of such data. If you can contribute to any of these collections (regardless of the language), please consider getting in touch with us!

    Should your corpus contain data from other people besides yourself, please make sure that you have the required permissions to distribute the data set.

    Recordings

    Text