An Internet agent for language model construction
A software agent is described which is able to take a seed (reference) corpus specified by the user, search the Internet for documents which are sufficiently similar to the seed corpus (as defined by a set of similarity metrics operating at a number of levels in the text), and augment the seed corpus with these documents. The size of the corpus and, hopefully, the quality of the derived language model, are thus progressively increased. The seed corpus may be quite a small collection of transcripts from the application domain, such as may be collected with minimal effort. Preliminary results are given for the perplexity of language models constructed using this approach. Potentially, our method has applications well beyond speech recognition, in corpus-based language processing in general, and document retrieval.
Item Type | Conference or Workshop Item (Paper) |
---|---|
Departments, Centres and Research Units | Computing |
Date Deposited | 04 Jun 2021 13:23 |
Last Modified | 10 Jun 2021 03:23 |