Review Comment:
The paper discusses the task of assigning topical labels to RDF datasets. The authors look on both types of this task - the single- and the multi-label classification task. In section 1, the authors motivate the task before they define the term "topic" and describe the LOD cloud dataset the paper is focusing on, in Section 2. Section 3 describes requirements for the benchmark. Section 4 describes the way of creating the benchmark while Section 5 gives detailed insights into the data the benchmark relies on. Section 6 describes the different feature vectors, classification algorithms, sampling techniques and normalization techniques the benchmarked approaches are relying on. In Section 7, the approaches are benchmarked with both available benchmarks - the already existing single-topic benchmark and the newly developed multi-topic benchmark. In Section 8, the authors discuss the results of Section 5. Section 9 presents related work and Section 10 summarizes the paper.
I was happy to see that the quality of the paper has been improved. The most issues have been addressed by the authors by either improving the paper or by giving convincing arguments why this is not possible. However, I see some minor issues that are still left. Together with the writing style problems and the issues in the references, I can not accept the paper in its current form.
=== Minor issues
- On Page 5, the authors describe the requirements for a benchmark. They describe "Scalability" with "The benchmark should be scalable and not have bias towards a specific technique". I am not convinced that being free of a bias is related to scalability. Note that [44] mentions being free of a bias for portability. After that, the authors describe the requirement "Solvability" with "Running the benchmark and measuring its performance is not difficult". From my point of view, this does not follow the definition in [44] and should be part of Clarity. If I misinterpret that sentence the authors should make it clearer.
- On Page 13, the authors descibe that Table 7 "shows the results of ALL feature vector and the combination of CURI, PURI, LCN and LPN". However, the table shows ALL and two pairs, namely PURI&CURI and LPN&LCN.
- On Page 19, there still is this strange short paragraph at the end of the related work section which does not make any sense to me. It refers to [38] and claims that it "uses LDA for the topical extraction of RDF datasets". However, the reference describes "Probabilistic latent semantic indexing" and does not do that. In the response the authors wrote that it has been fixed but it still wrong and should be removed from my point of view.
=== Writing Style
The paper has still a high number of grammatical errors and typos. I think that it has been improved compared to the first version. However, these errors should be avoided.
Again, I would suggest that the paper should be proofread by somebody with high English skills since my English skills are limited and I can not guarantee that my list below is complete.
From my point of view, the paper has three major grammar issues:
1. In general, the authors should stick to a single tense. The major part of the paper is written in simple present but some parts are written in past.
2. When writing in simple present, the authors forget to attach the ‘s’ when using the third person (singular).
3. Sometimes, the authors mix up singular and plural.
- Page 2 "Especially when the dataset do not" --> either "datasets do not" or "dataset does not"
- Page 4 "strengths and weakness" --> "strengths and weaknesses"
- Page 4 "benchmarks results" --> "benchmark results"
- Page 4 "required library" --> "required libraries"
- Page 6 "we consider as a gold standard" --> "we consider it as a gold standard"
- Page 6 "features extraction" --> "feature extraction"
- Page 7 "because want verify" --> "because we want to verify"
- Page 7 "We then summed up" --> "We then sum up"
- Page 7 "we extracted all" --> "we extract all"
- Page 7 (2x) "and generated" --> "and generate"
- Page 7 "we collected" --> "we collect"
- Page 7 (3x) "extracted" --> "extract"
- Page 7 "We used" --> "We use"
- Page 7 "lower-cased" AND "lowercased" --> "lowercase"
- Page 7 (2x) "tokenized" --> "tokenize"
- Page 7 "calculated" --> "calculate"
- Page 7 "This resulted" --> "This results"
- Page 8 "extracted" --> "extract"
- Page 8 "We were able" --> "We are able"
- Page 8 "Decision Trees are a powerful classification algorithms" --> "Decision Trees are a set of powerful classification algorithms" (or something similar)
- Page 8 "The decision tree is a tree with decision nodes which has two or more branches and leaf nodes that represents a classification or a decision" --> "The decision tree is a tree with decision nodes which have two or more branches and leaf nodes that represent a classification or a decision"
- Page 9 "Bayes's theorem" --> "Bayes' theorem"
- Page 9 "Naive Bayes need" --> "Naive Bayes needs"
- Page 9 "10 equal size" --> "10 equal sized"
- Page 9 "The 10 results from the folds can after be averaged" --> "The 10 results from the folds can be averaged"
- Page 9 "but by creating the same entity many times can result" --> "but creating the same entity many times can result"
- Page 10 "feature vectors in separation 7.1.1." --> "feature vectors in separation in section 7.1.1."
- Page 10 "we learned" --> "we train"
- Page 10 "we considered" --> "we consider"
- Page 10 "we trained" --> "we train"
- Page 12 "as we wanted to measure" --> "as we want to measure"
- Page 12 (2x) "attributes from all feature" --> "attributes from all features"
- Page 12 "computational complexity of algorithms especially when" --> "computational complexity of algorithms. Especially when"
- Page 12 "One of the biggest challenge" --> "One of the biggest challenges"
- Page 12 "In [26] is given an overview" --> "In [26], an overview ... is given"
- Page 12 "Although BR have" --> "Although BR has"
- Page 13 "are complementary one each other" --> "are complementary to each other"
- Page 13 "we also applied" --> "we also apply"
- Page 13 "performed better" --> "perform better"
- Page 13 "f-measure" OR "F-measure" but not both
- Page 13 "... while for the best results for the harmonic mean between precision and recall are achieved..." --> bad grammar, please rephrase.
- Page 13 "feature vector in input" --> "feature vector used as input"
- Page 13 "taking in input a combination of features" --> "taking a combination of features as input"
- Page 13 "LCN and LPN binary vector" --> "LCN and LPN binary vectors"
- Page 14 "input for Naive Bayes on no sampling data P = 0.42, R = 0.48 and F = 0.45." --> "input for Naive Bayes with no sampling (P = 0.42, R = 0.48 and F = 0.45)."
- Page 16 "In addition, for example the http://mcu.es/ dataset the Ministry of Culture in Spain" --> "In addition, for example the http://mcu.es/ dataset of the Ministry of Culture in Spain"
- Page 18 "topics in entity-relationship" --> "topics in entity relationships"
- Page 19 "with similar topic which are assumed to be good candidate for" --> "with similar topics which are assumed to be good candidates for"
=== Paper References
- The authors fixed some of the reference problems. However, there are still 3 references left which are listed twice: [14]=[38], [25]=[30], [29]=[47].
=== Comments (the authors may take them into account but don’t have to)
- The word “data” is a mass noun (the same holds for "metadata"). These nouns are typically not used as plural (https://en.oxforddictionaries.com/grammar/countable-nouns). Therefore, I would suggest to change the following formulations (although it is not a mistake)
- Page 2 "metadata that describe" --> "metadata that describes"
- Page 2 "metadata are completely missing" --> "metadata is completely missing"
- Page 8 "all data are" --> "all data is"
- Page 9 "emphasizing those particular data" --> "emphasizing this particular data"
- On Page 4, the authors write "It helps organizations understand strengths and weakness". I assume that the authors are describing strengths and weaknesses of systems/approaches. However, it would be easier to understand if it would be explicitly added, e.g., "strengths and weaknesses of solutions".
- On Page 5, the authors start a paragraph with "In one hand". Maybe I am simply not aware of this formulation being used but I assume that the authors want to use the common idiom "On the one hand".
- On Page 7, the authors formulate "and filtered out all values shorter than 3 characters or longer than 25 characters". Using "tokens" instead of "values" would make the sentence easier to understand.
- On Page 10 and 12, the authors use the formulation "We first report". I am not sure whether this is correct and would suggest to rephrase it to something like "First, we report".
- In the tables, the feature vectors are called "Curi" and "Puri" while in the text they are written "CURI" and "PURI".
|