Searching Page-Images of Early Music Scanned with OMR: A Scalable Solution Using Minimal Absent Words

Crawford, Tim; Badkobeh, Golnaz; and Lewis, David. 2018. 'Searching Page-Images of Early Music Scanned with OMR: A Scalable Solution Using Minimal Absent Words'. In: Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018. Paris, France 23 – 27 September 2018. [Conference or Workshop Item]
Copy

We define three retrieval tasks requiring efficient search of the musical content of a collection of ~32k page images of 16th-century music to find: duplicates; pages with the same musical content; pages of related music. The images are subjected to Optical Music Recognition (OMR), introducing inevitable errors. We encode pages as strings of diatonic pitch intervals, ignoring rests, to reduce the effect of such errors. We extract indices comprising lists of two kinds of ‘word’. Approximate matching is done by counting the number of common words between a query page and those in the collection. The two word-types are (a) normal ngrams and (b) minimal absent words (MAWs). The latter have three important properties for our purpose: they can be built and searched in linear time, the number of MAWs generated tends to be smaller, and they preserve the structure and order of the text, obviating the need for expensive sorting operations. We show that retrieval performance of MAWs is comparable with ngrams, but with a marked speed improvement. We also show the effect of word length on retrieval. Our results suggest that an index of MAWs of mixed length provides a good method for these tasks which is scalable to larger collections.


picture_as_pdf
Crawford, T., Badkobeh, G., Lewis, D. (2020) Searching Page-Images of Early Music Scanned with OMR- A Scalable Solution Using Minimal Absent Words.pdf
subject
Published Version
Available under Creative Commons: Attribution 4.0

View Download

Atom BibTeX OpenURL ContextObject in Span OpenURL ContextObject Dublin Core Dublin Core MPEG-21 DIDL Data Cite XML EndNote HTML Citation METS MODS RIOXX2 XML Reference Manager Refer ASCII Citation
Export

Downloads