An Internet agent for language model construction

Wyard, Peter; and Rose, Tony. 1997. 'An Internet agent for language model construction'. In: Recent Advances in Natural Language Processing. Bulgaria. [Conference or Workshop Item]

Copy

A software agent is described which is able to take a seed (reference) corpus specified by the user, search the Internet for documents which are sufficiently similar to the seed corpus (as defined by a set of similarity metrics operating at a number of levels in the text), and augment the seed corpus with these documents. The size of the corpus and, hopefully, the quality of the derived language model, are thus progressively increased. The seed corpus may be quite a small collection of transcripts from the application domain, such as may be collected with minimal effort. Preliminary results are given for the perplexity of language models constructed using this approach. Potentially, our method has applications well beyond speech recognition, in corpus-based language processing in general, and document retrieval.

Item Type	Conference or Workshop Item (Paper)
Departments, Centres and Research Units	Computing
Date Deposited	04 Jun 2021 13:23
Last Modified	10 Jun 2021 03:23