Meta-Data for a lot of LOD

Tracking #: 1282-2494

Laurens Rietveld
Wouter Beek
Rinke Hoekstra
Stefan Schlobach

Responsible editor: 
Aidan Hogan

Submission type: 
Dataset Description
This paper introduces the LOD Laundromat meta-dataset, a continuously updated RDF meta-dataset describing documents that are crawled, cleaned and (re)published by the LOD Laundromat. This meta-dataset of over 110 million triples contains structural information for more than 650,000 documents (and growing). While traditionally dataset meta-data is often not provided, incomplete, or incomparable in the way they were generated, the LOD Laundromat meta-dataset provides a wide variety of structural dataset properties, including the number of triples in LOD Laundromat documents, the average degree in documents, and the distinct number of Blank Nodes, Literals and IRIs. This makes it a particularly useful dataset for data comparison and analytics, as well as for the global study of the Web of Data.
Full PDF Version: 
Revised Version:
Previous Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Juergen Umbrich submitted on 04/Apr/2016
Minor Revision
Review Comment:

I would like to thank the authors for their effort in improving the submission.
Especially, the added visualisation of the dependency graph (Fig.2), the metadata description (Fig.3) and provenance model (Fig.4) help to better understand how the dataset can be explored and used.

In addition, the provided example queries in the use case section are nice and provides some ideas for further use of the dataset.

However, some previous comments were not directly addressed:

*) I think it would be nice to have some idea how people are using the dataset at the moment by describing the types of the 20,606,194 SPARQL queries. Maybe the authors could inspect the where clause of the queries; e.g., how many queries use filters, how many triple patterns, etc…

*) In agreement with the review of Sebastian Hellmann, Section 4.5 is still not really about the statistics of the dataset itself, but about its usage.
It would be nice to have statistics about the dataset itself, number of distinct properties, number of classes, etc…

*) Also the dissemination process can be further improved by providing insights into the update process of the statistics.
How is the scalability of LODLaudromat for the nightly builds. Is it possible to rerun the extraction of the statistics in less than 12 hours, or what is the time span for this.
This would be crucial information for someone who is using the dataset and relies on the up-to-date statistics

*) Table 1: I still would add the URIs for the meta-data properties.

Considering the SWJ evaluation criteria:

(1) Quality and stability of the dataset: the data is available and can be considered as stable.
The authors detail to some extent the process how the data is generated.
One minor issue regarding the quality is the lack of details about the up-to-dateness of the metadata.
The authors claim to recompute the statistics every night, but it is not clear how long the process takes and if the metadata is as such up-to-date.

(2) Usefulness of the dataset:
There is less doubt that the dataset is useful. The authors provide a good motivation and show use cases in which the metadata set can be used (e.g. verifying claims in papers, finding datasets with specific features, etc..)

(3) Clarity and completeness of the descriptions:
The description of the dataset was significantly improved with the added images of used vocabularies, schema, etc…
As such, it should be easy to explore and navigate the datasets based on the paper and the example queries.

Overall, I think the authors meet the requirements of SWJ wrt. a dataset paper.