RDF Dataset Profiling - a Survey of Features, Methods, Applications and Vocabularies

Tracking #: 1606-2818

Mohamed Ben Ellefi
Zohra Bellahsene
John Breslin
Elena Demidova
Stefan Dietze
Julian Szymanski
Konstantin Todorov

Responsible editor: 
Lora Aroyo

Submission type: 
Survey Article
The Web of Data, and in particular Linked Data, has seen tremendous growth over the past years. However, reuse and take-up of these rich data sources is often limited and focused on a few well-known and established RDF datasets. This can be partially attributed to the lack of reliable and up-to-date information about the characteristics of available datasets. While RDF datasets vary heavily with respect to the features related to quality, coverage, dynamics and currency, reliable information about such features is essential to enable dataset discovery in tasks such as entity linking, distributed query, search or question answering. Even though there exists a wealth of works contributing to the problem of dataset profiling in general, these works are spread across a wide range of communities. In this survey, we provide a first comprehensive survey of the RDF dataset profile features, methods, tools and vocabularies. We organize these building blocks of dataset profiling in a taxonomy and emphasize the links between the dataset profiling and feature extraction approaches and several application domains. The survey is aimed towards data practitioners, data providers and scientists, spanning a large range of communities and drawing from different fields such as dataset profiling, assessment, summarization and characterization. Ultimately, this work is intended to facilitate the reader to identify and locate the relevant features for building a dataset profile for intended applications together with the tools capable of extracting these features from the data.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Apr/2017
Minor Revision
Review Comment:

The revision addresses many of the previous comments. For this survey article the body of knowledge is more or less given now, and the presentation and motivation have improved. No doubt, one could identify many items to improve, but I believe that the text as such could be accepted.
There are still a few minor textual aspects to repair, e.g. lines running over the margin, line spacing at the beginning of 3.7.

Review #2
Anonymous submitted on 03/May/2017
Minor Revision
Review Comment:

I have read a previously submitted version of this article. The breadth of the work surveyed was impressive but I had rather negative assessment of it. The new version is much better. The updates in the taxonomy, the removal of the authors’ RDF vocabulary work, the extra explanations in e.g. section 4 and 6, are helpful.
There are some points that must be fixed still, see below. I would conjure the senior authors of the paper to have a triple check before resubmission, as some of the comments (and the most notable ones) apply to the parts that have been most heavily re-worked, thus raising some doubts about the seriousness of the writing process. As a small but revealing example, references [5] and [6] are identical. How come?!?
It is only because I have seen how the authors have changed their paper after previous comments that I am ready to trust they can make these final enhancements and produce an acceptable survey paper.

Notable problems:

- intro: “the authors are aware that domain-specific approaches to profile and annotate datasets exist. However, to ensure high relevance and applicability, this survey ad- dresses exclusively cross-domain approaches, which are agnostic to the domain of the profiled data.”.
I agree with the choice of narrowing the scope of the survey. This answer is generally appropriate to my earlier comment. I am still very surprised that for a survey paper (especially one that has reviewed so many references) no example is given. Readers would surely benefit from a couple of examples, to get the opportunity to realize the difference between what is in focus for this paper and what is not, when the objects seem similar.

- the new section on methodology is useful. However it lacks the listing of workshop papers. This is quite surprising!
Even more importantly it shows a new problem, especially for a journal like SWJ. The paper includes 86 references: removing [6] (see above) and [66] (a general reference) leads to 84, so the article’s bibliography cannot contain all the references (85) the author claim to have used for the survey. I’m willing to accept that the authors have found references that are not necessarily useful to report in the article, but there should at least be an online annex that gives them all. The list of keywords used to find them could also be good to see, for further assessing the methodology.

- In 4, there is a mismatch between the sections and the main feature categories. In fact this section keeps the structure of the previous version of the paper, without sections corresponding to provenance, licensing and links. And it still used the old ‘semantic features’ and ‘temporal features’ terms, which has been replaced by ‘general features’ and ‘dynamics features’ in the new version!

- in the intro of 5, “general-purpose vocabularies such as Dublin Core often provide useful terms also for dataset-specific metadata, but are not discussed in detail here to ensure sufficient focus on vocabularies of more particular relevance for RDF dataset profiling”. I am sorry but I can’t buy the argument. And in fact the authors don’t even buy it, it seems: they end up describing DC in 5.6. And fig 3 mentions DC in the ‘General’ category. So please refer to DC as early as in 5.1. If only because it DCAT is partly built on it. DC is also worth being mentioned in licensing (dct:License, dc:rights, etc.). And make sure that figure 3 is generally aligned with the content of section 5.

- EDOAL has been added in 5.2, which is good. The analysis is less good though. In fact this is a wrong sentence: “the typical use case for generating EDOAL statements is the manual formalisation of mapping statements, while less expressive SKOS and VoL statements can be at least partially generated from the output of automated linking and map- ping algorithms.” EDOAL has been created in the context of the community behind OntologyMatching.org, whose purpose is to evaluate and compare automatic alignment tools. And SKOS happens to be used to represent on the Semantic Web controlled vocabularies that have most often been built manually (even though it can also be used well for representing the result of automatic alignments)

- daQ has a paper reference, not just a URI. The reference given for DQV made me laugh a bit. I’m really not sure how the authors actually found a working draft from 2015. New versions of DQV have been published until December 2016 (https://www.w3.org/TR/vocab-dqv/). It is no surprise that I have big doubts about the analysis of DQV made by the authors, which is not substantiated anyway (‘several concerns about practical issues are raised as part of the DQV working draft documentation.’ - which concerns were they?)

- in general section 5 should really be rationalized wrt. space given to the explanations of the various vocabularies. For example in 5.6 I don’t understand why vocabularies coming from other domains like FOAF and SIOC are given as much (or more!) space together than PROV-O. FOAF and SIOC happen to have some elements relevant for provenance, while PROV-O is a quite complex vocabulary which is exclusively devoted to representing provenance facts.

- in 5.6 I still don’t understand why there is such a long introduction. If the authors want to define what provenance is, this should be done in another, earlier section.

- the authors have added a not on 5.8 trying to motivate that the statistics on general-purpose vocabularies may bring useful insight even if the statistics are on the parts specific to datasets that are used. I agree. However, the problem that I had tried to explain in my earlier review is not “we were unable to filter the instances of dataset profiling-specific terms from our suggested vocabularies while examining their usage statistics in LOD2”. My with was rather on filtering on LOD2stats datasets, not vocabularies: I would have liked statistics on datasets that describe datasets (i.e. data catelogues) rather than global statistics for any dataset. In fact I’m worried that LOD2stats has little if no datasets that are about datasets, which would undermine the usefulness of the study.

- in 6.6 I disagree with “Whereas some applications rely on the existing metadata, many applications choose generating dataset profile features as a part of their own processing pipelines. This can be attributed to missing dataset profile features in many cases.“ Many applications listed in section 6 (especially the data quality assessment tools) are designed to generate dataset profile features. It is their goal. So even if there was profile data pre-existing, they would still compute profile features again!
The conclusion has a similar sentence that should be removed.

- in 6.6. “we think that availability of dataset profiles including a wide range of features can potentially facilitate a new generation of applications in the distributed LOD settings”. This sentence is not really substantiated. Yes, one can say that more data will lead to more applications, but that’s not really groundbreaking, when no idea of these new kinds of applications is given. in the conclusion “This leads us to a conclusion that a-priori availability of dataset profiles could facilitate a broader use of pro- files and datasets in a variety of application domains.” is not very impressive either, especially when “this” (the previous sentence) is very debatable (see previous comment).

Smaller comments:

- p3: ‘adopted methodology to’ -> ‘methodology adopted to’
- section 3.3 really needs a reference for the many notions introduced there. There was at least one in the previous paper which seems appropriate ([6]). Or was it not?
- I don’t understand what makes licensing, provenance and links ‘orthogonal’, even though I’m willing to accept they can be separate (and recommended this, at least for licensing). What does ‘orthogonal in the distribution of profiles’ mean?
- the figures in table 2 should be given a date
- footnote 49 (http://creativecommons.org/licenses/by/3.0/) is not a appropriate reference for the Creative Commons licenses. Please use something else, e.g. http://creativecommons.org/licenses/. And please give a specific reference for CCrel (https://wiki.creativecommons.org/wiki/CC_REL or something like this) as it’s a different vocabulary.
- please make sure that the figres given for 5.8 reflect the latest update (Jan 17). Right now we don’t know.
- I’m still unsure why one needs so many references in the second paragraph of 6.5.

Review #3
By Heiko Paulheim submitted on 04/May/2017
Minor Revision
Review Comment:

The authors have taken considerable efforts to rework this paper. The interpretations and conclusions of the results of the individual sections make the paper much more valuable.

There are, however, a few (mainly minor) issues I would like to see addressed.

Table 1 lists tools that are either available online or as open source. This indicates that the intersection is empty, i.e., there are no tools that are both available as open source as well as public endpoints. Is that really the case?

Table 3 depicts some interesting trends that require an interpretation. For about half of the vocabulariers with non zero values, the trends for triples and datasets are contrary, i.e., there is an increase in triples and a decrease in datasets, or vice versa. The authors should comment on that. PROV-O is a very drastic example, with the number of datasets increasing from 1 to 17, while the number of triples decreases from 4,537 to 577.

The analysis of provenance and licesning vocabularies is a bit oversold. In section 5.8, the authors mention this in a few sentences themselves: they do not distinguish between using the vocabulary for provenance/licensing vs. using the vocabulary for something else, hence, the quantitative evaluation depicted in table 3 is a bit shaky. Here, the authors should be more careful in discussing their method of measurement. Furthermore, refinements might be possible here. For example, Hogan et al. [1] discuss an approach for finding license information which goes beyond purely looking at vocabularies (although still a bit hacky). Likewise, in [2], we also applied some further filters that go beyond merely spotting vocabularies.

Apart from those issues, which I assume the authors can fix, the paper provides some valuable insights and therefore should be published.

[1] Aidan Hogan, Jürgen Umbrich, Andreas Harth, Richard Cyganiak, Axel Polleres and Stefan Decker. "An empirical survey of Linked Data conformance pdf". In the Journal of Web Semantics 14: pp. 14–44, 2012.
[2] Schmachtenberg et al.: Adoption of the Linked Data Best Practices in Different Topical Domains, In: ISWC 2014