MIDI2vec: Learning MIDI Embeddings for Reliable Prediction of Symbolic Music Metadata

Tracking #: 2725-3939

Pasquale Lisena
Albert Meroño-Peñuela
Raphael Troncy

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
An important problem in large symbolic music collections is the low availability of high quality metadata, which is essential for various information retrieval tasks. Traditionally, systems have addressed this by relying either in costly human annotations or in rule-based systems at limited scale. Recently, embedding strategies have been exploited for representing latent factors in graphs of connected nodes. In this work, we propose MIDI2vec, a new approach for representing MIDI files as vectors based on graph embedding techniques. Our strategy consists of representing the MIDI data as a graph, including the information about tempo, time signature, programs and notes. Next, we run and optimise node2vec for generating embeddings using random walks in the graph. We demonstrate that the resulting vectors can successfully be employed for predicting the musical genre and other metadata such as the composer, the instrument or the movement. In particular, we conduct experiments using those vectors as input to a Feed-Forward Neural Network and we report good comparable accuracy scores in the prediction with respect to other approaches relying purely on symbolic music, avoiding feature engineering and producing highly scalable and reusable models with low dimensionality. Our proposal has real-world applications in automated metadata tagging for symbolic music, for example in digital libraries for musicology, datasets for machine learning, and knowledge graph completion.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Lyndon Nixon submitted on 22/Mar/2021
Minor Revision
Review Comment:

The paper presents an interesting approach to produce graph embeddings of MIDI music files, and use those embeddings for tasks such as predicting genre or automatic metadata extraction. The entire paper is well written, easy to read and I appreciate the additional explanations which make the paper very accessible also to readers without a strong background in areas like graph embeddings or knowledge of MIDI files.

Introduction: The advantage of embeddings approaches to discover latent features in the data compared to the need to pre-select features in a feature engineering approach rounds up the contribution of the introduction.
This argumentation repeats itself in the Related Work section, also with unnecessary repetition (page 3, right column, lines 17-33). Consider to shorten the argumentation in the introduction which is based on others work, mentioning only the conclusions which you are following in your own work, and the detail of the work remains in the Related Work section.
Following the longer introduction, the Related Work section feels too brief. You focus on work with vectors (in machine or deep learning). Suggest to also add an overview on work on 'music knowledge graphs' since you also need to represent the MIDI files in a graph model before creating the embeddings. Constrast other's ontological decisions with yours (choice of properties, format of property values..)

In section 3, the bold face for Deep Walk is unnecessary. I would cover the MIDI2graph mapping first and then the approaches to graph embedding, as I can now justify my choice of embedding approach according to the structure and content of my graph (e.g. you could explain your reasons for the particular configuration of node2vec you report).

In section 4.2, I was not sure why you created an embedding vector for each node in the graph then excluded every vector generated for a node which was not a MIDI file. What is the difference to using only the nodes representing MIDI files as input to the vector computation?
The test set computation is admitted to be created by an "approach ... not commonly applied". Can you provide additional justification for why then this approach was chosen?
Section 5.2. Do you consider it a limitation that your metadata prediction experiment was limited to music from a single genre?
'The interlinking gives access to precise metadata' > how did you come to this conclusion?
p. 12 right column line 22: loose > lose (or lost)
'All those results should be analysed with a grain of salt, given the absence of balance between classes in the dataset.' - this could be better phrased as it reads like the whole experiment is questionable in terms of any useful findings. Either then the experiment needs to be repeated in a different form or you could phrase this in a more useful way, e.g. while encouraging, further tests are needed with datasets with a better balance between classes.
Section 5.3 introduces a much larger and mixed MIDI dataset, why wasn't this used for artist or instrument prediction?

In Section 6 the mention of temporal graphs makes sense for time-based media like an audio track. Also I am not sure if the current approach can work with fragments of the audio, i.e. find that a certain segment of a MIDI file is very similar in some way to some other segment (as opposed to a description of the entire MIDI file).
The availability of larger data sets would be ideal since deep learning approaches are supposed to be more effective at scale. Maybe crowdsourcing could be considered as a means to annotate a large data set such as the one mentioned (300k MIDI files). Finally, the only disappointment I can mention is that you emphasized at the beginning the value of embeddings for latent feature extraction but the rest of the work does not further consider this. Are there no plans to use MIDI2vec in this way?

The paper is worthy of publication though I would hope it can be revised to answer the questions that this review raises.

Review #2
Anonymous submitted on 22/Apr/2021
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

In the paper "MIDI2vec: Learning MIDI Embeddings for Reliable Prediction of Symbolic Music Metadata" proposed by authors, a new method called MIDI2vec is presented. It represents MIDI data as vector-space embeddings for automated metadata classification. MIDI files are first express as graphs, then two MIDI files are connected if they share the same resources. Graph embeddings techniques are employed (e.g. node2vec) to traverse the MIDI graphs with random walks and represent the information of the traversed paths as numeric vectors.
The paper is well written and structured and presents innovative ideas. A number of questions and issues should be addressed before the paper may be

Within the Evaluation secction the authors comment Table 1 indicating that their approach outperforms the baseline when only symbolic data are used in input and indicate an accuracy of 86% for 5-classes and 67% for 10-classes prediction. It is not clear to me (and the authors should point that out) why they consider their ALL case as their to be compared against the baseline using just the symbolic music. Should'nt be fairer to compare the S row of the baseline of table 1 against row N of their approach of the same table? Same doubts hold for the reamining of section 5.1. Also, for section 5.1 what was the split between training and test data (i.e. how many MIDI files have been used for training and how many for testing)?
In section 5.2 a question that authors should address is related to table 2. Can the authors comment why for instrument their method achieves 48.6% for 4 classes prediction, which is the lowest accuracy for the four classes? I mean, classes 10 and 9 have higher accuracy and it should not be like that. A behaviour similar to jSymbolic is naively and intuitively assumed (where for 6 classes the accuracy is better than for 9 and 10 classes and lower than for 4 classes).

One question is related to section 4.1. Has the information about Tempo, Programs, Time signature and Notes been manually annotated for the input dataset? If yes, the process should be explained as well, given that it is not easy but it should hinder some complexities.

I see embeddings and experiments data are available in public notebooks. What about the generated raw graphs? Are they available in some public repository to too?

The authors should also comment what happens if they use their method on a different domain (not really songs but something like soundtracks music of videogames). For example, for the dataset provided here https://github.com/chrisdonahue/nesmdb would the proposed method be effective or something needs to be carefully redesigned or observed?

The paper needs proofread and there are typos to be fixed. Some of them are:
- In the introduction section: "Most of this systems".
- In section 5.3 "classifier build on top of jSym-bolic"
- In section 5.2 "The consistent number of classes do not let us detect"

Review #3
Anonymous submitted on 21/Jun/2021
Minor Revision
Review Comment:

In the article “MIDI2vec: Learning MIDI Embeddings for Reliable Prediction of Symbolic Music Metadata” the authors present a new approach for representing MIDI files as vectors based on graph embedding techniques. The authors have evaluated the MIDI2vec representation by comparing the prediction results on different dataset testing the MIDI2vec + Feed-Forward Neural Network against other well-known approach in the domain obtaining competitive results.

I think the paper is appropriate for the journal, and should be accepted only with minor revisions. Below I highlight some of the points of improvements I detected in my review.

- In the introduction the authors stated that in literature have been used Recurrent Neural Network for the task of Music Generation. However, the authors decided to use Feed-Forward Neural Network without explaining why they used this instead of other Neural Networks.
- In Page 3. What do the authors mean with “feature engineering - common in pre-deep machine learning”?
- In section 4.1 the authors described the content of the graph, and how to convert the MIDI file and all its content to nodes. However, it’s not fully clear what is the content of the group of notes and how the authors address the fact that a group of notes can appear more than once in the midi files? Since the nodes are unique, will a new node, (identical to the previous one) be created or do you just add a counter to the node inserting the number of occurrences of the group of notes in the midi?
- Do the authors connect the notes to the program used to produce them?
- What is the size of the classes in MuseData dataset?

Minor Fixes:
Page 1. “Most of this systems” -> “Most of these systems”
Page 9. Figure 6 is cited before Figure 5
Figure 6b. Labels are not readable in an A4 printed page
Page 10. Figure 8 is cited before Figure 7
Page 13. “build on top of” -> “built on top of”
Figure 11a. X and Y axis labels are not readable