Dataset Profiling - a Guide to Features, Methods, Applications and Vocabularies

Tracking #: 1296-2508

Mohamed Ben Ellefi
Zohra Bellahsene
John Breslin
Elena Demidova
Stefan Dietze
Julian Szymanski
Konstantin Todorov

Responsible editor: 
Lora Aroyo

Submission type: 
Survey Article
The Web of data, in particular Linked Data, has seen tremendous growth over the past years. However, reuse and take-up is limited and focused on a few well-known and established knowledge bases. This can be attributed in parts to the lack of reliable and up-to-date information about the characteristics of available datasets. While datasets vary heavily with respect to features related to quality, coverage, dynamics and currency, reliable information about such features is essential for enabling data and dataset discovery in tasks such as entity retrieval or distributed search. Even though there exists a wealth of works contributing to this central problem of dataset profiling, these are spread across a range of communities and disciplines. Here, we provide a first comprehensive survey of dataset profiling features, methods, tools and vocabularies and also provide an RDF vocabulary for unambiguously identifying dataset features.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Heiko Paulheim submitted on 17/Mar/2016
Major Revision
Review Comment:

The paper gives a comprehensive overview on works for profiling linked data sets. While the sheer amount of works in this area collected and organized in the survey is impressive, the paper would benefit from showing the findings and conclusions in a more concise manner.

First, the title is misleading, as it talks about "Dataset Profiling" in general, but is limited to RDF/Linked Data. I suggest making the title more concise here (instead of increasing the scope of the survey to match the title).

The collection of features and profiling characteristics in section 2 is really comprehensive and interesting. It would be, however, even more concise if measurements for the different characteristics were introduced, where applicable. I would also expect that for some of the chracteristics, different measurements are proposed in different works, so that they could be contrasted and discussed.

The section about tools is also quite impressive, and I was suprised that so many practical profiling tools exist. The authors might consider adding a summary table with the basic characteristics of the tools (open source/commercial, Web/standalone app, etc.).

While I appreciate the section about applications, it feels like this selection is a bit arbitrary. Applications that could be listed as well include, e.g., (natural text) question answering, information visualization, or semantic annotation, so it is not clear why the five examples shown in the paper have been chosen. Furthermore, it is not clear why the list of features presented for each application should be exhaustive.

As far as the listing of vocabularies is concerned, I would also like to see a summary table (similar to the tools section), showing the coverage of the vocabularies w.r.t. different groups of features. This would enable the readers to pick and/or combine vocabularies that fit their needs.

Finally, I miss some more conclusions. For a survey, it is always interesting to close with a judgement of the findings, and some outlook on a future research agenda. Relevant questions could be: what aspects are currently underrepresented in Linked Data profiling? Which combinations of features would be desirable, but are not supported by any tool? Is there a match or a mismatch between theory, tools, and the requirements of applications that exploit data profiles?

In summary, I appreciate the very comprehensive nature of the survey, but I would like to see the findings being presented more concisely in a revised version.

* in the beginning of section 2, it is mentioned that the notion of an "atomic feature" is introduced, but it only described as a leaf in the hierarchy. It is unclear what those atomic features are in the end, why it is clear that they cannot be further subdivided, and why it is important.
* Please use a blank before a reference (i.e., "reference [1]", not "reference[1]".)
* p.3: "set of RDF triples shared" - I would rather suspect that two datasets share entities, but it is unlikely that they share exactly the same triples about those entities

Review #2
Anonymous submitted on 12/Apr/2016
Review Comment:

This paper presents an extensive state of the art on dataset profiling.

The state-of-the-art is impressive. Unfortunately there are too many problems. This paper is not really mature enough for a journal publication.

First, there is a lack of precise framework defining the notions used. The paper actually does not define what is a dataset. This is not very strict, while catalogues and main used vocabularies like DCAT try to make crucial distinctions like dataset vs distribution. “Descriptive metadata, i.e. profiles” on p1 is also disturbing. Profiles are not limited to what is called descriptive metadata for many (e.g., access metadata is not descriptive metadata). In fact for several communities working with descriptive metadata, the notion of “application profile” could conflict with the one of “dataset profile” that is quite different.

The state-of-the-art is very extensive, and constitutes a very useful resource for would-be reader. This is probably indeed the first time this is attempted at such a scale! But there are two problems with it:

1. Even though it is extensive, it is incomplete with respect to dataset profiling. Authors focus on gathering references for dataset (quality) analysis. This is good, but there are important efforts that are not mentioned, about creating frameworks for expressing profiles, especially providing (or re-using) vocabularies. These could have been compared with what the authors proposed. DCAT and VOID are refered but the analysis is very cursory. More domain-specific efforts have been ignored: for example, the Health Care and Life Science community has researched a dataset profile ( ). In the geographic domain, the EU initiative GeoDCAT-AP should also be studied. The state-of-the-art on profiling and quality also misses reference to relevant ISO standards: domain-specific as for the geo ones that influence GeoDCAT-AP or more general like the ISO25000 family (esp 25012).

2. There are in consistencies in the way the references are used. First, in section 2 many references are given for different profiles features. Most of them are data analysis papers. For a gathering of features, simple references to a vocabulary or inventory (e.g. a dataset catalogue) that exhibit the features would be enough to give a requirement for the topic. Actually this would be more convincing than a piece of academic work, possibly very technical, which may fall short giving a practical motivation for what it does (as the focus would be on an algorithm or an experiment). For example DCAT has properties for representing the domain/topic of a dataset. There is no need to refer to [59] that extracts such topics. On the other hand, section 3 that gathers methods to extract profile features doesn’t refer to these papers that have been cited in section 2, that presents such methods. This is quite a missed opportunity.
Furthermore, some of the references in section 2 seem superfluous for explaining what specific features are: I am not sure one would need both [6] and [35] ([6] looks more relevant) or both [52] and [53] or both [36] and [37].

In 2.2 there is a problem with the reference for the classification of quality characteristics, which is presented as coming partly from [97]. This is only an ‘under review’ paper, without URL nor publication context, so it’s unclear what the source is. As a matter of fact the authors mention in [97] have published a very recent paper in this very journal, “Quality Assessment for Linked Data: A Survey: A Systematic Literature Review and Conceptual Framework”, so I’ve used this one. When I’ve done the comparison, I found many differences, even some features that are classified in different dimensions, which I guess would be found in any recent work of the authors of [97]. For example licensing is in “accessibility” in [97], it is in Trust in the paper. It is possible that [97] (and other references like [92]) has shortcomings. But the choices made here are debatable (I would argue that licensing is orthogonal to data quality). And in any case such deviations compared to the state of the art should be explained. They are currently not even flagged as such, it’s very difficult for the reader to guess what is happening in this section. Section 2 actually reads as if the authors propose a new quality framework, which could be interesting but is arguably not what the section had embarked on.

In section 3.2 the authors should make more explicit whether they re-use the matter of [97] for a subset of the systems there, if they extend that matter. If that section contains original material, it should be more explicit. If not, then it can be considerably shortened.

In section 3.3 it’s unclear whether all systems mention really “extract” temporal characteristics (as written in the title of the section), or if they just manage them:
- Semantic pingback as it is described in the text mostly focuses on cases where a dataset is being re-used in other sources. In principle the pinging doesn’t change the content of the original dataset, and thus doesn’t facilitate consistency and timeliness per se. It’s possible that the publishers would integrate changes based on the pings, but it’s not essential to the general pinging approach.
- Memento is a mechanism to serve different time versions of data. It represents data that can be used to compute temporal characteristics (for example number of versions) but it doesn’t extract them by itself.

In section 4.2 and 4.3 tols are presented that compute statistics or make assessments based on them. But these tools don’t really motivate the need for statistics to be already expressed in a profile. On the contrary, they compute these statistics or extract features themselves, by querying the data and/or running inferences. It’s not very difficult to extract say, the number of properties used in a dataset. And it’s more reliable than using already published profile data, which could be outdated - for data assessment tools this is crucial.

Then, the stats on vocabulary usage analysis in section 5 is very promising, but it doesn’t look reliable. The data is from early 2015, probably it has changed one year later. There are some finding that are very surprising, such as Creative Commons being used only for 12 datasets. As the authors write it themselves later in the paper, there must be more data out there with CC licenses.
More importantly, it’s uncertain whether table 2 really gives the info the authors claim it gives in the paragraph p17-18. The text says that LOD2 is used to find info on how many times a vocabulary is used in datasets. But this doesn’t mean that the vocabularies are used in these datasets for dataset profiling (i.e. to describe datasets, e.g. instance of dcat:Dataset). For example Dublin Core and FOAF can be used to described many types of resources that are not datasets. If the authors have indeed filtered in the LOD2 data the statements that are about datasets, this should be explained in more details. Without these details, one will infer that the data is not about datasets, and thus that it’s not very informative for section 5 in general.

Finally, the paper completely falls short on presenting the “RDF vocabulary for unambiguously identifying dataset features” that was promised in the abstract. A link is given to, but the elements of this vocabulary are not listed and documented. And there’s no instruction/example of how to use it. Shall it be combined with existing vocabularies? Is it an alternative to all of them, combined?

Some more minor comments (only selected bits, as there are just too many to report here):

- p1: “As the Web of Data is constantly evolving, manual assessment of dataset features is neither feasible nor sustainable.” This statements is debatable. Sure, it won’t be possible to profile all datasets manually, but one could argue that it would be a feature of good providers that they provide at least some profile metadata.

- p5: the difference between stability of URIs and stability of links is quite unclear, as the only definition given to characterize the stability of links refers to stability of URIs.

- p5: what does “explore the space of a given source, i.e., search and discover data sources of interest.” mean?

- p5: the relation between the references in section 3 and the criteria at 2 is unclear at times. For “electing the smallest set of relevant predicates representing the dataset in the instance comparison task”, do the predicates correspond to a specific criterion in section 2? (are they RDF predicates? And why the paragraph say “we review” while it does just drop the various aspects of the keys discovery approach?)

- p5: footnote 9 is not finished.

- p12: [83] reads more like state-of-the-art for section 2.2 than a vocabulary for 5.2. Same comment applies to the bullet list in the second column of this page.

- p12: why not mention EDOAL as a reference for alignment vocabulary, next to (or instead of) VoL? Why not mention that VoID also has a part for linksets? Why not mention daQ for data quality? SPIN on the other hand is not for expressing data quality features. Representing rules is quite different from representing the results of applying rules.

- p15: Dublin Core is also used by DCAT and many others for licensing, so it should be in 5.6 (and maybe also other sub-sections as it’s a very general vocabulary)

Review #3
Anonymous submitted on 19/Jun/2016
Major Revision
Review Comment:

The paper concerns dataset profiling and aims to be a guide for features, methods, applications and vocabularies.

The main comment on the current version of this paper is that it does not make a good and well-motivated review: it does list many items that are relevant but the way in which this is all put into a bigger picture is not well-motivated. That limits the value of the review. Authors should in a new version pay much more attention to the justification of passing on all these links.

Moreover, if the aim is to make this paper to be the goto-paper for this subject, kicking off a new piece of research, then the paper should contain more of a definition of the subject and a tangible contribution for other papers to cite and build upon. Currently, besides the paper’s subject there is not a concrete thing that other authors would start massively linking to and citing. If such a problem definition would be added the paper has much more chance of being cited and this much more value.

All in all, the paper carries lots of interesting links and items but lacks in a clear justification and aggregation, to be a pivotal review paper.

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. 50 (but see comments for how to make this a better review paper)
(2) How comprehensive and how balanced is the presentation and coverage. 30 (but see the comments for how to add justification and objectives etc.)
(3) Readability and clarity of the presentation. 50
(4) Importance of the covered material to the broader Semantic Web community. 40

Detailed comments:

The abstract mention the paper consider ‘works’ but does not address that in the context of this subject it would be interesting to learn from both scientific works as well as industrial works, or framed differently people working towards the profiling as researchers as well as people working where uptake takes place or could take place.
In the introduction it is suggested that the solution is coming from the side of researchers that unite and join a common path. That is a fair ambition, but requires two things: 1) that this choice is explicitly made and thus it is clear what the target audience is of the paper, 2) that this does not lead to yet another researcher proposal that does not get uptake. Notwithstanding this ambition, it is nice to see that for outsiders and beginners an overview is created with lots of useful pointers.
One thing that could be stressed more in the opening parts of the paper is why this is special and specific for Linked Data (as opposed to any dataset type).

In section 2, it would be nicer if the opening would explain both the objective of identifying the characteristics (including the separation into semantic etc.) and the justification of the way that this was done. After all, for a review like this it is important to justify the review approach, to indicate completeness etc.
When representing features, it would be good to perhaps include examples and concrete values to illustrate what they really are. Otherwise the true meaning of mentioning a feature is left too vague.
The other main comment regarding the presentation of these features is whether this is supposed to tell what the literature says (knowing that not all scientific papers have large impact) or whether this is aiming to make a ‘chosen’ summary of that (the authors making a weighted account of what the literature has reported).
The current text of the section looks like it could be equally well have been given as a simple table or list, but of course the section should do more as juts present a list; it should also make the reader understand how the list was composed. So, assuming that the content of the list is fair, it is recommended that the justification gets more attention.

In section 3 the authors use the word ‘review’. In addition to mentioning all the tools and approaches it would be interesting to see what the review entails. After all the review more or less by definition implies that the tools and approaches are considered from a chosen perspective and it would be good to clarify the perspective and its motivation.
Similarly, at the end of the section, or at the end of the subsections it would be nice to see some concluding and summarizing remarks. After having learned from the different items discussed what they do individually, it would be good to see what the authors see as bigger picture for that aspect.

For the next sections I could repeat the same line of comments. A number of aspects are given, leading to subsections, but it is unclear whether these aspects pop up because this is what the literature offers or whether these were the aspects the authors carefully chose to use for their analysis of the state of the art. The value of the lists of items discussed depends strongly on the way the aspects are chosen and applied.

Section 5 appears to follow a different presentation style: unclear why. It is also not easy to understand how the different elements in this section go together.
In line with the unclear ambition of this section, it is odd to see that the authors start making recommendations in this reviewing section. I would strongly recommend to separate the observations from the aggregate conclusions and from any subsequent recommendations.