Review Comment:
The paper discusses the task of assigning topical labels to RDF datasets. The authors look on both types of this task - the single- and the multi-label classification task. In section 1, the authors motivate the task before they describe the LOD cloud dataset they are relying on, in Section 2. Section 3 describes the development of their multi-topic benchmark. Section 4 describes the different feature vectors, classification algorithms, sampling techniques and normalization techniques the benchmarked approaches are relying on. In Section 5, the approaches are benchmarked with both available benchmarks - the already existing single-topic benchmark and the newly developed multi-topic benchmark. In Section 6, the authors discuss the results of Section 5. Section 7 presents related work and Section 8 summarizes the paper.
=== Positive aspects
+ This task is very important when working with a growing LOD cloud.
+ The creation of a benchmark is a very important step for this research field since it a) eases the comparison of approaches and b) it makes it easier for other researchers to enter the area (since they don't have to invest a lot of time into creating their own ground truth).
+ The novelty of the work is good since there does not seem to be another benchmark like this.
+ The usage of the majority class classification as baseline is a good choice.
=== Major issues
- The paper seems to be an extension of [45] but the authors tried to change the focus of the paper to fit the special issue. While [45] describes the classification approaches and their evaluation, the authors try to focus this extended version on the benchmark dataset they created for the multi-topic classification task. However, the authors haven't applied this strategy to the complete paper and created a paper that "pretends" to describe a benchmark but still focusses a lot on the approaches and their evaluation. The following points describe why I came to this conclusion.
- Section 4 is called "Benchmark Settings" but describes "4.1. Feature Vectors", "4.2. Classification Approaches", "4.3. Sampling techniques" and "4.4. Normalization techniques" which are describing the benchmarked approaches but are not part of the benchmark. The main components of the benchmark should be a) the set of RDF datasets, b) the ground truth (i.e., the topical labels for the single datasets) and c) a metric to measure the success of a benchmarked system based on the system output. For me, only 4.1 is related to the benchmark itself, since it is interesting to see the statistics about datasets that have values for the single features. The other parts of Section 4 are clearly parts of a system that tries to tackle the task. For example, it is up to the system whether it uses sampling or normalization techniques while it has no influence on the benchmark itself.
- In the introduction, the authors state that they want to "[...] discuss the choke points which influence the performance of [multi-topic profiling] systems". Although they find some interesting choke points, they could present them in a better way. For example, the bbc.co.uk/music example discussed in Section 6.2 shows an important choke point: vocabularies that are only used in a single dataset. Why do the authors not combine that with the statistics that they have about the datasets? They could provide the number of vocabularies and how many datasets are using this type. Another choke point could be the size of the datasets for which the authors do provide a statistic for the complete crawl but not for the datasets that are part of the benchmark.
- Another step to transform this paper more into a benchmark paper (and move it away from simply benchmarking the approaches published in [45]), would be the benchmarking of other approaches. I am aware of the problem that some approaches are very special or might not be available as open source programs. However, [R1] presents a simple "topical aspect" for their search engine that could be used as a similarity measure for two datasets. Based on that, an easy baseline could be defined that was not part of [45].
- Another hint that the paper internally focusses a lot on the approaches instead of the benchmark can be found in the description of the related work. The authors compare nearly all other approaches with their approaches described in Section 4 regarding the features that they are using. If the paper would focus on the benchmark itself, the authors may have compared their benchmark dataset with the data used for the evaluation of the other approaches, e.g., the data from [8] or why the authors have chosen precision, recall and F-measure instead of the normalised discounted cumulative gain used in [8].
Together with the long list of minor issues, the mistakes in writing and the problems in the references, I think that this paper needs a major revision before it can be published in the semantic web journal.
=== Minor issues
- In the abstract, the authors state "it has been shown, that in most cases, a single topical label for one datasets does not reflect the variety of topics
covered by the contained content". However, the authors do not proof this statement nor do they cite a proof.
- The definitions on page 4 are confusing. Why is a topic T a set of labels {l_1, ..., l_k} when the single-topic classification chooses a single label l_j from the set of labels {l_1, ..., l_p}. Why is L defined in Definition 3 and not in Definition 1 so that it could have been reused in the other definitions? It also looks like l_k has two different roles in Definition 1 and 3 which should be avoided.
- The description of the gold standard presented from [35] is wrong (pages 5 and 17). The authors state that it would be a gold standard for multi-topic classification. This is wrong because the gold standard from [35] has been created for finding topically similar RDF datasets and does not contain any topical labels or classifications.
- On page 6, the authors write "rdfs:classes and owl:classes". Shouldn't this be rdfs:Class and owl:Class?
- On page 7, the authors write "... as described in the VOC feature vector there are 1 453 different vocabularies. From 1 438 vocabularies in LOD, ..." Why are these two numbers different?
- Footnote 13 on page 7 is not helpful at all since nobody knows when the authors have executed their experiments.
- The description of "overfitting" on page 7 is wrong.
- On page 9, the authors state that "Classification models based on the attributes of the LAB feature vector perform on average (without sampling) around 20% above the majority baseline, but predict still in half of all cases the wrong category". Taking into account Table 3, this sentence seems to be wrong. If LAB achieves 51.85% + 20% (or 51.85% * 1.2) it leads to an accuracy that is higher than predicting every second category wrong.
- On page 15, the authors cite [34] but I assume that they wanted to cite [35] because [34] does not fit in there.
- On page 17, the authors start a paragraph with "Some approaches propose to model the documents ...". As a reader who is not familiar with [35], it is hard to understand what a "document" is since it has not been defined before. While I understand that the authors cite [32] it is also not clear to me why the authors are citing [33] and [34] in their paper. Neither Pachinko Allocation [33] nor Probabilistic Latent Semantic Analysis [34] is related to their work or the related work they are describing in this paragraph.
- On page 17, the authors state that "approaches that use LDA are very challenging to adapt in cases when a dataset has many topics". This is neither proven by the authors nor do they cite a publication that contains a proof.
- On page 17, the authors write "These approaches are very hard to be applied in LOD datasets because of the lack of the description in natural language of the content of the dataset" when discussing the application of LDA. However, they contradict this statement by citing [35] which is not bound to natural language description of datasets (LAB or COM) but can also make use of LPN or CPN.
- At the end of the Section 7, they briefly repeat the description of [35] but with a faulty reference to [34].
- The authors state that the benchmark will be made publicly available. However, I couldn't find a link in the paper to the benchmark (there is only a link to the LOD cloud data crawled with LDSpider). Since this is no blind submission I do not see a reason why the authors do not allow the reviewers to have a look at the benchmark itself.
=== Writing Style
The paper has a high number of grammatical errors and typos making some parts of the paper hard to read. In the following, I will list some of the errors (I gave up to collect all of them on page 7). However, it is not sufficient to fix only the errors listed here. A check of the complete paper (maybe by somebody who is not one of the authors) is highly recommended.
- Page 1 "for one datasets" --> "for one dataset"
- Page 6 "We extracted ten feature vectors because want to" --> "... because we want ..."
- Page 6 "We lowercase all values and tokenize them at space characters and filtered out all values shorter than 3 characters and longer that 25 characters" --> "We lowercase all values, tokenize them at space characters and filtered out all values shorter than 3 characters or longer that 25 characters"
- Page 6 "This because"
- Page 6 "In the LOV website, there exist 581 different vocabularies." In this sentence, "in" seems to be the wrong preposition. There are a lot of discussions, whether "on" or "at" are correct when talking about things that can be found on (or at) a website (e.g., https://english.stackexchange.com/questions/8226/on-website-or-at-website).
- Page 7 "Among different metadata, it is also given the description in natural language for each vocabulary." --> "Among different metadata, the description in natural language for each vocabulary is given."
- Page 7 "581^13" --> a footnote shouldn't be added to a number. Otherwise the number of the footnote can be confusing.
- Page 7 "While in LOD as described in the VOC feature vector there are 1 453 different vocabularies" ?
- The paper shows some minor formatting problems that need to be fixed before publishing it.
- Several words are written into the margin (i.e., hyphenation rules should have been applied). This can be seen on pages 3, 4, 6, 9 and 16.
- Tables 3 and 8 are too wide.
- While two feature sets are called PURI and CURI they are called "PUri" and "CUri" in the tables.
=== Paper References
- The paper has 45 references. However, it seems like the authors don't have a good strategy to handle these references because several references are listed twice ([9] = [34], [21] = [26], [24] = [35], [25] = [44], [36] = [38]) and [16] simply seems to be a newer version of [17].
- [28] has a character encoding problem in the title.
- [30] has only authors and title. It is missing additional data, e.g., the conference, publisher or year. At least I couldn't find it with the given information.
- [45] looks like the title has not been defined correctly, since it is not formatted as a title.
=== Comments
- It might be better to use the F-measure for the inter-rater agreement, see [R2].
- On page 5, "but this work was done before" should be replaced with "but our work was done before" since "this" could refer to different papers
[R1] Kunze, S., Auer, S.: "Dataset retrieval". IEEE Seventh International Conference on Semantic Computing (ICSC), 2013.
[R2] George Hripcsak and Adam S Rothschild: "Agreement, the f-measure, and reliability in information retrieval". Journal of the American Medical Informatics Association, 12(3):296–298, 2005.
|