Simon/Contribute Data: Difference between revisions
No edit summary |
No edit summary |
||
Line 5: | Line 5: | ||
For us to develop a system that you can use out-of-the-box, we depend on large collections of such data. If you can contribute to any of these collections (regardless of the language), please consider getting in touch with us (see below). | For us to develop a system that you can use out-of-the-box, we depend on large collections of such data. If you can contribute to any of these collections (regardless of the language), please consider getting in touch with us (see below). | ||
Should your corpus contain data from other people besides yourself, please make sure that you have the required permissions | Please keep in mind that in order to facilitate free and open collaboration, the data you share with us needs to be distributable under a permissive license! Should your corpus contain data from other people besides yourself, please make sure that you have the required permissions. | ||
== Recordings == | == Recordings == | ||
Line 11: | Line 11: | ||
Do you have access to a large set of audio recordings of human speech? Do you have (at least rough) transcripts of these recordings? | Do you have access to a large set of audio recordings of human speech? Do you have (at least rough) transcripts of these recordings? | ||
If you share this data with us, we can use these recordings to train what is called the acoustic model - the part that tells Simon how words actually sound. To submit the data, please [mailto:[email protected]?subject=Simon:%20Audio%20submission | If you share this data with us, we can use these recordings to train what is called the acoustic model - the part that tells Simon how words actually sound. To submit the data, please [mailto:[email protected]?subject=Simon:%20Audio%20submission contact us] - we can help you to sort out the data format, licensing and provide you with web space to publish the data. Chances are, that we'll also thank you profusely in the process. | ||
== Text == | == Text == | ||
Do you have text from e.g. your blog, the newspaper you're managing or the book you've been writing? | |||
By sharing this data with us, we can make Simon better understand what "normal" sentences look like. To provide us access to your texts, please [mailto:[email protected]?subject=Simon:%20Text%20submission drop us a line]. We can then guide you through uploading the machine readable plain text to our open database. |
Revision as of 14:45, 10 August 2013
To build a speech recognition system, several types of data files are required:
- Transcribed audio samples to learn how a human pronounces the sounds (phonemes) that make up the dictionary
- Large corpora of written text to learn what word structures commonly co-occur (provides context for the recognizer)
For us to develop a system that you can use out-of-the-box, we depend on large collections of such data. If you can contribute to any of these collections (regardless of the language), please consider getting in touch with us (see below).
Please keep in mind that in order to facilitate free and open collaboration, the data you share with us needs to be distributable under a permissive license! Should your corpus contain data from other people besides yourself, please make sure that you have the required permissions.
Recordings
Do you have access to a large set of audio recordings of human speech? Do you have (at least rough) transcripts of these recordings?
If you share this data with us, we can use these recordings to train what is called the acoustic model - the part that tells Simon how words actually sound. To submit the data, please contact us - we can help you to sort out the data format, licensing and provide you with web space to publish the data. Chances are, that we'll also thank you profusely in the process.
Text
Do you have text from e.g. your blog, the newspaper you're managing or the book you've been writing?
By sharing this data with us, we can make Simon better understand what "normal" sentences look like. To provide us access to your texts, please drop us a line. We can then guide you through uploading the machine readable plain text to our open database.