Blending Acoustic and Language Model Predictions for Automatic Music Transcription
Adrien Ycart, Andrew McLeod, Emmanouil Benetos and Kazuyoshi Yoshii - ISMIR 2019
On this page can be found the supplementary material for:
Adrien Ycart, Andrew McLeod, Emmanouil Benetos and Kazuyoshi Yoshii. "Blending Acoustic and Language Model Predictions for Automatic Music Transcription" 20th International Society for Music Information Retrieval Conference (ISMIR), November 2019, Delft, Netherlands.
The code to reproduce the experiments presented in the paper can be found at this address: https://github.com/adrienycart/MLM_decoding
The splits we used to train the acoustic model and language model are available here:
Examples of results
Here, we propose selected examples to illustrate our model's performance: one where our model is successful, one where it fails. In both cases, we display a figure comparing:
- The ground truth
- The transcription obtained by thresholding the posteriogram at 0.5
- The transcription obtained with HMM smoothing
- The transcription obtained with our system in PM+S configuration
- The posteriogram
- The predictions made by the language model
All comparisons are shown using a 16th-note timestep, hence the varying number of frames in both examples. The images can be displayed full size by using right-clicking the image and selecting "Open image in new tab".
We also propose the ground truth, thresholded posteriogram, HMM-smoothed posteriogram and PM+S transcriptions converted to MIDI format ( download all MIDI files here).
Here, our model manages to successfully detect the short notes around frame 250, while the other two baseline models do not. This is mostly due to our language model: the MLM predictions show that it is able to recognise that pattern of short, repeated notes.
Here, our model does not improve much compared to the other two baseline models. If anything, it over-fragments notes, (e.g. around frame 80), and even adds some false positives (e.g. around frame 160). We can see that the MLM predictions are very blurry, failing to predict any note with confidence.