Review Comment:
The paper presents an interesting approach to produce graph embeddings of MIDI music files, and use those embeddings for tasks such as predicting genre or automatic metadata extraction. The entire paper is well written, easy to read and I appreciate the additional explanations which make the paper very accessible also to readers without a strong background in areas like graph embeddings or knowledge of MIDI files.
Introduction: The advantage of embeddings approaches to discover latent features in the data compared to the need to pre-select features in a feature engineering approach rounds up the contribution of the introduction.
This argumentation repeats itself in the Related Work section, also with unnecessary repetition (page 3, right column, lines 17-33). Consider to shorten the argumentation in the introduction which is based on others work, mentioning only the conclusions which you are following in your own work, and the detail of the work remains in the Related Work section.
Following the longer introduction, the Related Work section feels too brief. You focus on work with vectors (in machine or deep learning). Suggest to also add an overview on work on 'music knowledge graphs' since you also need to represent the MIDI files in a graph model before creating the embeddings. Constrast other's ontological decisions with yours (choice of properties, format of property values..)
In section 3, the bold face for Deep Walk is unnecessary. I would cover the MIDI2graph mapping first and then the approaches to graph embedding, as I can now justify my choice of embedding approach according to the structure and content of my graph (e.g. you could explain your reasons for the particular configuration of node2vec you report).
In section 4.2, I was not sure why you created an embedding vector for each node in the graph then excluded every vector generated for a node which was not a MIDI file. What is the difference to using only the nodes representing MIDI files as input to the vector computation?
The test set computation is admitted to be created by an "approach ... not commonly applied". Can you provide additional justification for why then this approach was chosen?
Section 5.2. Do you consider it a limitation that your metadata prediction experiment was limited to music from a single genre?
'The interlinking gives access to precise metadata' > how did you come to this conclusion?
p. 12 right column line 22: loose > lose (or lost)
'All those results should be analysed with a grain of salt, given the absence of balance between classes in the dataset.' - this could be better phrased as it reads like the whole experiment is questionable in terms of any useful findings. Either then the experiment needs to be repeated in a different form or you could phrase this in a more useful way, e.g. while encouraging, further tests are needed with datasets with a better balance between classes.
Section 5.3 introduces a much larger and mixed MIDI dataset, why wasn't this used for artist or instrument prediction?
In Section 6 the mention of temporal graphs makes sense for time-based media like an audio track. Also I am not sure if the current approach can work with fragments of the audio, i.e. find that a certain segment of a MIDI file is very similar in some way to some other segment (as opposed to a description of the entire MIDI file).
The availability of larger data sets would be ideal since deep learning approaches are supposed to be more effective at scale. Maybe crowdsourcing could be considered as a means to annotate a large data set such as the one mentioned (300k MIDI files). Finally, the only disappointment I can mention is that you emphasized at the beginning the value of embeddings for latent feature extraction but the rest of the work does not further consider this. Are there no plans to use MIDI2vec in this way?
The paper is worthy of publication though I would hope it can be revised to answer the questions that this review raises.
|