Review Comment:
I have read a previously submitted version of this article. The breadth of the work surveyed was impressive but I had rather negative assessment of it. The new version is much better. The updates in the taxonomy, the removal of the authors’ RDF vocabulary work, the extra explanations in e.g. section 4 and 6, are helpful.
There are some points that must be fixed still, see below. I would conjure the senior authors of the paper to have a triple check before resubmission, as some of the comments (and the most notable ones) apply to the parts that have been most heavily re-worked, thus raising some doubts about the seriousness of the writing process. As a small but revealing example, references [5] and [6] are identical. How come?!?
It is only because I have seen how the authors have changed their paper after previous comments that I am ready to trust they can make these final enhancements and produce an acceptable survey paper.
Notable problems:
- intro: “the authors are aware that domain-specific approaches to profile and annotate datasets exist. However, to ensure high relevance and applicability, this survey ad- dresses exclusively cross-domain approaches, which are agnostic to the domain of the profiled data.”.
I agree with the choice of narrowing the scope of the survey. This answer is generally appropriate to my earlier comment. I am still very surprised that for a survey paper (especially one that has reviewed so many references) no example is given. Readers would surely benefit from a couple of examples, to get the opportunity to realize the difference between what is in focus for this paper and what is not, when the objects seem similar.
- the new section on methodology is useful. However it lacks the listing of workshop papers. This is quite surprising!
Even more importantly it shows a new problem, especially for a journal like SWJ. The paper includes 86 references: removing [6] (see above) and [66] (a general reference) leads to 84, so the article’s bibliography cannot contain all the references (85) the author claim to have used for the survey. I’m willing to accept that the authors have found references that are not necessarily useful to report in the article, but there should at least be an online annex that gives them all. The list of keywords used to find them could also be good to see, for further assessing the methodology.
- In 4, there is a mismatch between the sections and the main feature categories. In fact this section keeps the structure of the previous version of the paper, without sections corresponding to provenance, licensing and links. And it still used the old ‘semantic features’ and ‘temporal features’ terms, which has been replaced by ‘general features’ and ‘dynamics features’ in the new version!
- in the intro of 5, “general-purpose vocabularies such as Dublin Core often provide useful terms also for dataset-specific metadata, but are not discussed in detail here to ensure sufficient focus on vocabularies of more particular relevance for RDF dataset profiling”. I am sorry but I can’t buy the argument. And in fact the authors don’t even buy it, it seems: they end up describing DC in 5.6. And fig 3 mentions DC in the ‘General’ category. So please refer to DC as early as in 5.1. If only because it DCAT is partly built on it. DC is also worth being mentioned in licensing (dct:License, dc:rights, etc.). And make sure that figure 3 is generally aligned with the content of section 5.
- EDOAL has been added in 5.2, which is good. The analysis is less good though. In fact this is a wrong sentence: “the typical use case for generating EDOAL statements is the manual formalisation of mapping statements, while less expressive SKOS and VoL statements can be at least partially generated from the output of automated linking and map- ping algorithms.” EDOAL has been created in the context of the community behind, whose purpose is to evaluate and compare automatic alignment tools. And SKOS happens to be used to represent on the Semantic Web controlled vocabularies that have most often been built manually (even though it can also be used well for representing the result of automatic alignments)
- daQ has a paper reference, not just a URI. The reference given for DQV made me laugh a bit. I’m really not sure how the authors actually found a working draft from 2015. New versions of DQV have been published until December 2016 ( It is no surprise that I have big doubts about the analysis of DQV made by the authors, which is not substantiated anyway (‘several concerns about practical issues are raised as part of the DQV working draft documentation.’ - which concerns were they?)
- in general section 5 should really be rationalized wrt. space given to the explanations of the various vocabularies. For example in 5.6 I don’t understand why vocabularies coming from other domains like FOAF and SIOC are given as much (or more!) space together than PROV-O. FOAF and SIOC happen to have some elements relevant for provenance, while PROV-O is a quite complex vocabulary which is exclusively devoted to representing provenance facts.
- in 5.6 I still don’t understand why there is such a long introduction. If the authors want to define what provenance is, this should be done in another, earlier section.
- the authors have added a not on 5.8 trying to motivate that the statistics on general-purpose vocabularies may bring useful insight even if the statistics are on the parts specific to datasets that are used. I agree. However, the problem that I had tried to explain in my earlier review is not “we were unable to filter the instances of dataset profiling-specific terms from our suggested vocabularies while examining their usage statistics in LOD2”. My with was rather on filtering on LOD2stats datasets, not vocabularies: I would have liked statistics on datasets that describe datasets (i.e. data catelogues) rather than global statistics for any dataset. In fact I’m worried that LOD2stats has little if no datasets that are about datasets, which would undermine the usefulness of the study.
- in 6.6 I disagree with “Whereas some applications rely on the existing metadata, many applications choose generating dataset profile features as a part of their own processing pipelines. This can be attributed to missing dataset profile features in many cases.“ Many applications listed in section 6 (especially the data quality assessment tools) are designed to generate dataset profile features. It is their goal. So even if there was profile data pre-existing, they would still compute profile features again!
The conclusion has a similar sentence that should be removed.
- in 6.6. “we think that availability of dataset profiles including a wide range of features can potentially facilitate a new generation of applications in the distributed LOD settings”. This sentence is not really substantiated. Yes, one can say that more data will lead to more applications, but that’s not really groundbreaking, when no idea of these new kinds of applications is given. in the conclusion “This leads us to a conclusion that a-priori availability of dataset profiles could facilitate a broader use of pro- files and datasets in a variety of application domains.” is not very impressive either, especially when “this” (the previous sentence) is very debatable (see previous comment).
Smaller comments:
- p3: ‘adopted methodology to’ -> ‘methodology adopted to’
- section 3.3 really needs a reference for the many notions introduced there. There was at least one in the previous paper which seems appropriate ([6]). Or was it not?
- I don’t understand what makes licensing, provenance and links ‘orthogonal’, even though I’m willing to accept they can be separate (and recommended this, at least for licensing). What does ‘orthogonal in the distribution of profiles’ mean?
- the figures in table 2 should be given a date
- footnote 49 ( is not a appropriate reference for the Creative Commons licenses. Please use something else, e.g. And please give a specific reference for CCrel ( or something like this) as it’s a different vocabulary.
- please make sure that the figres given for 5.8 reflect the latest update (Jan 17). Right now we don’t know.
- I’m still unsure why one needs so many references in the second paragraph of 6.5.