Topic profiling benchmarks: issues and lessons learned

Tracking #: 1809-3022

Blerina Spahiu
Andrea Maurino
Robert Meusel

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Topical profiling of the datasets contained in the Linking Open Data cloud diagram (LOD cloud) has been of interest for a longer time. Different automatic classification approaches have been presented, in order to overcome the manual task of assigning topics for each and every individual new dataset. Although the quality of those automated approaches is comparably sufficient, it has been shown, that in most cases, a single topical label for one dataset is not sufficient to understand the content of a dataset. Therefore, within the following study, we present a machine-learning based approach in order to assign a single, as well as multiple topics for one LOD dataset and evaluate the results. As part of this work, we present the first multi-topic classification benchmark for the LOD cloud, which is freely accessible and discuss the challenges and obstacles which needs to be addressed when building such benchmark datasets.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Antonis Koukourikos submitted on 25/Feb/2018
Minor Revision
Review Comment:

In the revised version, the authors generally responded sufficiently to the comments provided in the initial review.

Regarding the validity of the presented results with respect to the LOD Cloud version used, it is understandable that the effort for carrying out the experiments over an updated version would require an amount of time prohibitive for presenting the results in time.

However, given that the expansion of the LOD Cloud is known and significant, I would still like to see some informed/quantitative estimation on how the proposed method will scale.

There are some typos still present in the revised version, some further proof reading will be welcome (also, the title change as proposed in another review is not visible in the version available from the reviewer side).

Review #2
By Nikolay Nikolov submitted on 27/Feb/2018
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The current version of the paper provides a positive improvement over the initial version in terms of focus and general quality of the work. The authors provide more details for the rationale of the decisions made, the evaluation criteria of such benchmarks in general, as well as more detail on how the work fits within that framework of requirements.

Formatting and structure:
The paper generally uses language and grammar at a better level, its structure and formatting have been improved, although there are still issues. I have further identified more minor grammatical/phrasing errors that I list below.

* Formatting issues:
(from previous review that were not addressed despite the authors saying that they did)
- Fig. 1 is not possible to read (both legend and labels of the nodes of the LOD graph), I would suggest to remove them and leave the coloured graph nodes, which are referred in the text
- Fig. 3 is very difficult to read - I suggest to make this picture larger - the labels are an eye test
Page 14 - last part of a sentence from the previous page is not formatted properly
Page 4 - text goes outside of the column margin

* Grammatical/phrasing errors:
The use of commas with relative clauses is inconsistent throughout the paper ( Other issues with commas include their usage with connecting words (thus, therefore, etc.). The use of articles ("a" vs. "the" vs none) is also inconsistent throughout the paper. I suggest to the authors to have a thorough review of the paper performed from a native speaker for the camera ready version if the paper gets accepted. Apart from the aforementioned issues, below find a list of others that I identified while reviewing:

(from previous review that were not addressed despite the authors saying that they did)
(Section 1) "Up till now, topical categories were" - should be "Up until now, topical categories have been"
(Section 2) in the descriptions of each of the topical categories, there should be a full article in front of the name of each category - e.g., "THE government category contains Linked Data published by [...]"
(Sub-section 4.1) "if people in a dataset are annotated with foaf:knows statements or if her professional affiliation is provided" - should be "their" instead of "her"
(Section 8) "The multitopic benchmark is heavy imbalanced" should be "The multitopic benchmark is heaviLY imbalanced"

(newly identified errors)
(Abstract) "and obstacles which needs" => "and obstacles, which need"
(Section 1) "The adoption of Linked Data over the last fey years have" => "The adoption of Linked Data over the last fey years has" (the adoption ... has)
(Section 1) "Especially when the dataset do not" => "Especially when the dataset does not"
(Section 1) "Managing large and rapidly increasing" => "Managing the large and rapidly increasing"
(Section 1) "The high volume of data demands data consumers to develop" => "The high volume of data demands that data consumers develop"
(Section 3) "characteristics of the benchmarc is to be easy" => "characteristics of a good benchmark is to make it easy"
(Section 3) "covers all the topics that already exist" => "covers only topics that already exist" (I don't imagine the benchmark covers ALL topics in the LOD cloud, so I am guessing that the authors mean ONLY topics within the LOD cloud instead)
(Section 4) "In one hand" => "On the one hand"
(Section 4) "existent" => "existing"
(Section 5) "We first report the results of the experiments for single-topic classification algorithms as in [48] to which extent" - there seems that there is a missing part of the sentence after the citation
(Section 6.1) "All values were lowercased, tokenized at space characters and filtered out all values shorter than 3 characters or longer than 25 characters" => "All values were lowercased and tokenized at space characters, after which all values shorter than 3 characters or longer than 25 characters were filtered out"
(Section 6.1) "In difference from" => "In contrast to"
(Section 6.2) "tons of alternative" is too informal => "a large number of alternatives"
(Section 6.2) "Classification and Regression Trees and ID3 and C4.5" => "Classification Trees, Regression Trees, ID3, and C4.5"
(Section 7.1.1) "eight categories described in 5" => "eight categories described in Section 5"
(Section 7.2.1) "In [26] is given" => "[26] gives"/"[26] provides"
(Section 8.2) "RDFS (781), OWL (134) and PURL (14) times" - "times" should be removed

(from previous review that were not addressed despite the authors saying that they did)
* In my opinion, since the major contribution of the paper is about topic profiling benchmarks specifically in LOD (other domains are discussed in related work), the authors should consider changing the title accordingly - e.g., "Topic profiling benchmarks in the Linked Oped Data Cloud: issues and lessons learned"

Overall, I think the paper has been appropriately edited in (for the most part) accordance to the comments I provided in the first review. I urge the authors to THOROUGHLY address the issues identified by myself and other reviewers. I still think that the work is important and highly relevant to the topic of the journal and should be accepted with the minor adjustments I list above.

Review #3
By Michael Röder submitted on 01/Mar/2018
Minor Revision
Review Comment:

The paper discusses the task of assigning topical labels to RDF datasets. The authors look on both types of this task - the single- and the multi-label classification task. In section 1, the authors motivate the task before they define the term "topic" and describe the LOD cloud dataset the paper is focusing on, in Section 2. Section 3 describes requirements for the benchmark. Section 4 describes the way of creating the benchmark while Section 5 gives detailed insights into the data the benchmark relies on. Section 6 describes the different feature vectors, classification algorithms, sampling techniques and normalization techniques the benchmarked approaches are relying on. In Section 7, the approaches are benchmarked with both available benchmarks - the already existing single-topic benchmark and the newly developed multi-topic benchmark. In Section 8, the authors discuss the results of Section 5. Section 9 presents related work and Section 10 summarizes the paper.

I was happy to see that the quality of the paper has been improved. The most issues have been addressed by the authors by either improving the paper or by giving convincing arguments why this is not possible. However, I see some minor issues that are still left. Together with the writing style problems and the issues in the references, I can not accept the paper in its current form.

=== Minor issues
- On Page 5, the authors describe the requirements for a benchmark. They describe "Scalability" with "The benchmark should be scalable and not have bias towards a specific technique". I am not convinced that being free of a bias is related to scalability. Note that [44] mentions being free of a bias for portability. After that, the authors describe the requirement "Solvability" with "Running the benchmark and measuring its performance is not difficult". From my point of view, this does not follow the definition in [44] and should be part of Clarity. If I misinterpret that sentence the authors should make it clearer.
- On Page 13, the authors descibe that Table 7 "shows the results of ALL feature vector and the combination of CURI, PURI, LCN and LPN". However, the table shows ALL and two pairs, namely PURI&CURI and LPN&LCN.
- On Page 19, there still is this strange short paragraph at the end of the related work section which does not make any sense to me. It refers to [38] and claims that it "uses LDA for the topical extraction of RDF datasets". However, the reference describes "Probabilistic latent semantic indexing" and does not do that. In the response the authors wrote that it has been fixed but it still wrong and should be removed from my point of view.

=== Writing Style
The paper has still a high number of grammatical errors and typos. I think that it has been improved compared to the first version. However, these errors should be avoided.
Again, I would suggest that the paper should be proofread by somebody with high English skills since my English skills are limited and I can not guarantee that my list below is complete.
From my point of view, the paper has three major grammar issues:
1. In general, the authors should stick to a single tense. The major part of the paper is written in simple present but some parts are written in past.
2. When writing in simple present, the authors forget to attach the ‘s’ when using the third person (singular).
3. Sometimes, the authors mix up singular and plural.

- Page 2 "Especially when the dataset do not" --> either "datasets do not" or "dataset does not"
- Page 4 "strengths and weakness" --> "strengths and weaknesses"
- Page 4 "benchmarks results" --> "benchmark results"
- Page 4 "required library" --> "required libraries"
- Page 6 "we consider as a gold standard" --> "we consider it as a gold standard"
- Page 6 "features extraction" --> "feature extraction"
- Page 7 "because want verify" --> "because we want to verify"
- Page 7 "We then summed up" --> "We then sum up"
- Page 7 "we extracted all" --> "we extract all"
- Page 7 (2x) "and generated" --> "and generate"
- Page 7 "we collected" --> "we collect"
- Page 7 (3x) "extracted" --> "extract"
- Page 7 "We used" --> "We use"
- Page 7 "lower-cased" AND "lowercased" --> "lowercase"
- Page 7 (2x) "tokenized" --> "tokenize"
- Page 7 "calculated" --> "calculate"
- Page 7 "This resulted" --> "This results"
- Page 8 "extracted" --> "extract"
- Page 8 "We were able" --> "We are able"
- Page 8 "Decision Trees are a powerful classification algorithms" --> "Decision Trees are a set of powerful classification algorithms" (or something similar)
- Page 8 "The decision tree is a tree with decision nodes which has two or more branches and leaf nodes that represents a classification or a decision" --> "The decision tree is a tree with decision nodes which have two or more branches and leaf nodes that represent a classification or a decision"
- Page 9 "Bayes's theorem" --> "Bayes' theorem"
- Page 9 "Naive Bayes need" --> "Naive Bayes needs"
- Page 9 "10 equal size" --> "10 equal sized"
- Page 9 "The 10 results from the folds can after be averaged" --> "The 10 results from the folds can be averaged"
- Page 9 "but by creating the same entity many times can result" --> "but creating the same entity many times can result"
- Page 10 "feature vectors in separation 7.1.1." --> "feature vectors in separation in section 7.1.1."
- Page 10 "we learned" --> "we train"
- Page 10 "we considered" --> "we consider"
- Page 10 "we trained" --> "we train"
- Page 12 "as we wanted to measure" --> "as we want to measure"
- Page 12 (2x) "attributes from all feature" --> "attributes from all features"
- Page 12 "computational complexity of algorithms especially when" --> "computational complexity of algorithms. Especially when"
- Page 12 "One of the biggest challenge" --> "One of the biggest challenges"
- Page 12 "In [26] is given an overview" --> "In [26], an overview ... is given"
- Page 12 "Although BR have" --> "Although BR has"
- Page 13 "are complementary one each other" --> "are complementary to each other"
- Page 13 "we also applied" --> "we also apply"
- Page 13 "performed better" --> "perform better"
- Page 13 "f-measure" OR "F-measure" but not both
- Page 13 "... while for the best results for the harmonic mean between precision and recall are achieved..." --> bad grammar, please rephrase.
- Page 13 "feature vector in input" --> "feature vector used as input"
- Page 13 "taking in input a combination of features" --> "taking a combination of features as input"
- Page 13 "LCN and LPN binary vector" --> "LCN and LPN binary vectors"
- Page 14 "input for Naive Bayes on no sampling data P = 0.42, R = 0.48 and F = 0.45." --> "input for Naive Bayes with no sampling (P = 0.42, R = 0.48 and F = 0.45)."
- Page 16 "In addition, for example the dataset the Ministry of Culture in Spain" --> "In addition, for example the dataset of the Ministry of Culture in Spain"
- Page 18 "topics in entity-relationship" --> "topics in entity relationships"
- Page 19 "with similar topic which are assumed to be good candidate for" --> "with similar topics which are assumed to be good candidates for"

=== Paper References
- The authors fixed some of the reference problems. However, there are still 3 references left which are listed twice: [14]=[38], [25]=[30], [29]=[47].

=== Comments (the authors may take them into account but don’t have to)
- The word “data” is a mass noun (the same holds for "metadata"). These nouns are typically not used as plural ( Therefore, I would suggest to change the following formulations (although it is not a mistake)
- Page 2 "metadata that describe" --> "metadata that describes"
- Page 2 "metadata are completely missing" --> "metadata is completely missing"
- Page 8 "all data are" --> "all data is"
- Page 9 "emphasizing those particular data" --> "emphasizing this particular data"

- On Page 4, the authors write "It helps organizations understand strengths and weakness". I assume that the authors are describing strengths and weaknesses of systems/approaches. However, it would be easier to understand if it would be explicitly added, e.g., "strengths and weaknesses of solutions".
- On Page 5, the authors start a paragraph with "In one hand". Maybe I am simply not aware of this formulation being used but I assume that the authors want to use the common idiom "On the one hand".
- On Page 7, the authors formulate "and filtered out all values shorter than 3 characters or longer than 25 characters". Using "tokens" instead of "values" would make the sentence easier to understand.
- On Page 10 and 12, the authors use the formulation "We first report". I am not sure whether this is correct and would suggest to rephrase it to something like "First, we report".
- In the tables, the feature vectors are called "Curi" and "Puri" while in the text they are written "CURI" and "PURI".