Structured collections of annotated linguistic data are essential in most areas of NLP, however, we still face many obstacles in using them.
The goal of this chapter is to answer the following questions: Along the way, we will study the design of existing corpora, the typical workflow for creating a corpus, and the lifecycle of corpus.
Don’t leave it to the auto-generated default messages.
Explain to your user what is wrong and suggest solutions.
Here you can see screen captures from Firefox and Chrome: Text alert messages are generated entirely by the browser and will even translate automatically into different languages - something that would be almost impossible using just Java Script.
Additionally, the design strikes a balance between multiple speakers saying the same sentence in order to permit comparison across speakers, and having a large range of sentences covered by the corpus to get maximal coverage of diphones.
As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.
The TIMIT corpus of read speech was the first annotated speech database to be widely distributed, and it has an especially clear organization.
Even if users correctly enter an integer, for example, you might need to make sure that the value falls within a certain range.
Important Validating user input is also important for security.