Meta-Data for a lot of LOD

Tracking #: 1153-2365

Laurens Rietveld
Wouter Beek
Stefan Schlobach
Rinke Hoekstra

Responsible editor: 
Aidan Hogan

Submission type: 
Dataset Description
This paper introduces the LOD Laundromat Meta-Dataset, a continuously updated Meta-Dataset of the LOD Laundromat, tightly connected to the (re)published corresponding datasets which are crawled and cleaned by the LOD Laundromat. The Meta-Dataset contains structural information for over 38 billion triples (and growing). While traditionally dataset meta-data is often not provided, incomplete, or incomparable in the way they were generated, the LOD Laundromat Meta-Dataset provides a wide variety of structural dataset properties using standardized vocabularies. This makes it a particularly useful dataset for data comparison and analytics, as well as for the global study of the Web of Data.
Full PDF Version: 
Revised Version:
Previous Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sebastian Hellmann submitted on 02/Nov/2015
Review Comment:

This review was written together with Ciro Baron.

The lack of accurate and high quality metadata describing datasets is an evident problem with available (Linked Data) datasets. The authors extracted metadata from over 650,000 datasets, making the current paper quite unique in terms of size and effort of publishing dataset descriptions. This work is relevant and I strongly believe the proposed dataset has significant value, if done properly. That said, I think that the paper still has significant room for improvement as described below.

Our overall recommendation is to reject and resubmit, because of these three issues:
1. Reading the reviews of the first submission again, we can find many points from us and other reviewers that still hasn’t been addressed. So the two-strike rule is definitely applicable
2. Given that over half of the paper is writing about the vocabulary and not about the data, we suggest to split it into an ontology paper and a linked data set paper to gain more focus on the individual parts.
3. Using the five star rating from we started out with 0 stars as the ontology was not available. Now we have barely one star. There is (2nd ★) no usage of RDFS or OWL and (3rd ★) no links to other ontologies and also (4th ★) no vocab metadata.


1. Improper reuse of vocabulary
We looked at the provided example and found a lot of issues that are still quite problematic. The biggest problem, we see, is the improper and insufficient reuse of existing vocabularies, especially DCAT and PROV-O. should be a dcat:Distribution that is also typed as prov:Entity . Then you can use Prov-O to make the link to the Laundromat dataset using prov:derivedFrom. This derivation wasGeneratedBy some prov:Activity The use of Prov-O is somewhat too minimal, especially since the dataset is made up of quite a few derivation processes.
Most importantly, we find the design choice to simply recreate existing properties in an own ontology quite strange. Many ontologies such as DCAT are standards. So using a custom property “url” and not dcat:downloadURL is in fact a proprietary format. I would like to advise the authors to read DCAT and Prov-O again and use these properties directly, so any DCAT and Prov-O implementation can easily parse the dataset you created. Otherwise you introduce unnecessary inoperability and force clients to implement an extra parser for your ontology. applies.
In cases, you deviate from standards like DCAT, I would direly advise you to: add subporpertyOf or subclassof statements to the Ontology, so your properties can be automatically recognized by implementations via reasoning (be aware that this is annoying). If you are unable to use subproperty/subclassof , you should include a justification why you deviate in the ontology itself.

2. Lack of description
- A major point of criticism is that the actual dataset is not sufficiently described. The statistics in 4.5 are for the LOD Laundromat itself, which is another dataset. Figure 2 is one of the few contributions that give insight into the actual data contained in the meta-data dataset. The remainder of the paper discusses design choices for the used ontology. It seems easy enough to write up some SPARQL SUM and AVG queries to get an idea about the datasets described by the metadata as well as the metadata dataset itself. Class and Property usage would be interesting.
- Another issue of the dataset is that it's easy to find nodes that contain statistical data which are very hard to understand as such. However, neither the paper nor the ontology itself offer any explanation.

3. Unclear role of the described ontology
- Table 1 is quite empty. The right part can be squeezed a bit and two columns with usage statistics as well as datatype/range can be added. The top contains a huge amount of empty space. In my opinion, it is not sufficient to just add a reference to online statistics or other papers such as footnote 26. The description of how datasets are selected to objectivize RDF HDT are also not contained. The reader is referred to [18]. While some of the properties are motivated, others are not clearly useful.
- Several statistical properties are not explained on the paper although they are part of the ontology. Some examples are mean, standard deviation and median. Even in the ontology description, it's not possible to know what they are used for and why they are there. They are poorly described. A better exploration of these properties would considerably improve the paper.
- I strongly recommend the addition of the license property. The authors claim that only 0.5% of datasets contains any kind of license data, which means a considerable number of more than 32.000 datasets.
- How is void-ext:language used? A better explanation about it would be very interesting.
- The description of indegree and outdegree in the ontology seems wrong. Indegree refers to the subjects, and the outdegree refers to the object.

4. Regarding the terminology:
- What exactly is a Meta-Dataset? Do you mean Meta-Data Dataset? You could even call it the LOD Laundromat-Meta dataset.
- What is a Linked Data Document? Why do you capitalize "Document"?
- I found data description, where dataset description was intended.
- I am unsure whether "Degree, in-degree, out-degree" should be capitalized. They are common measures in my opinion unlike F-Measure.

Paper specific issues and details:
Section 3.1:
- For completeness DataID might be mentioned, but we are not existing as the work originated in our group and it is rather a minor aspect:
DataID: Towards Semantically Rich Metadata for Complex Datasets ( by Martin Brümmer, Ciro Baron, Ivan Ermilov, Markus Freudenberg, Dimitris Kontokostas, and Sebastian Hellmann in Proceedings of the 10th International Conference on Semantic System
- I am suggesting to merge Section 2 and 3 into a common section: Domain analysis and requirements. Section 3 talks a lot about LODStats and other approaches mentioned in Section 2, which is an indicator.

- The requirements does not reflect some of the findings in the analysis, e.g.
-- Req 1 is very broad. Could you specify, what you mean by wide scope?
-- Req 2: must use standard vocabularies in a non-ambiguous way as discussed for "void:entities", I understand that you are patching current standards in order to cope. This should be mentioned. We basically have similar problems in DataID with the too generic DCAT and the non-sufficient VOID.
The recommendation is not prescriptive: standards "should" be used if applicable, else not. You write this yourself in section 4.1
-- Req 3: the meta-data part that can be automatically generated, must be automatically generated.
-- Req 4: starts with scalability, but then justifies Req 1 in the second sentence. Heterogeneity is Req 1. Req 4 should rather mention the requirement that only those properties must be used whose values can be collected in a single pass via streaming. In fact this requirement is mentioned in Section 4.1
-- Quality generally requires consistently complete properties, but this is not mentioned as requirement. E.g. properties that (1) cover necessary information properly and (2) are consistently extracted without missing values.
- Req 7 reads like one of the criteria of the SWJ to get a linked data set accepted. Does anybody besides SWJ formulate such a requirement?

Section 4.1 motivates the properties used in the dataset.
- The use of provenance could receive its own subsection. To me the actual connection to DCAT is unclear and not elaborated. Is it true that DCAT is not used at all as it can not be generated automatically?
- It's hard to understand why the dataset description might be larger than the dataset itself. Does the meta-data dataset store all the distinct subjects resources or only counting the number? In that context, it's not clear what a dataset "partition" means.
- Figure 1 describes the datasets degree? The figure explanation mentions that "indegree" word. Also, the word "combined" degree is confusing and a bit weird here.
Section 4.5
- This section contains statistical data that should be replaced by relevant statistical analysis generated based on llm:descriptiveStatistics. A large number of datasets were crawled and the paper lacks a description of a basic statistical analysis.

- Figure 2 shows the serialization formats. I believe there should be more charts like that. The authors still have half a page to show statistical data, considering they are describing more than 650,000 datasets. Maybe more details about most used classes, properties, top N datasets with higher and lower degrees, etc.

- spelling:
-- please use either "meta-data", "meta data" or "metadata" consistently, see the first sentence of the introduction, where contrary to the title "metadata" is used
-- Variations of the same word (e.g. "In Degree" and indegree) are found.