LOD in a Box: The C-LOD Meta-Dataset

Tracking #: 868-2078

Authors: 
Laurens Rietveld
Wouter Beek
Stefan Schlobach

Responsible editor: 
Aidan Hogan

Submission type: 
Dataset Description
Abstract: 
This paper introduces the C-LOD (Clean & Linked Open Data) Meta-Dataset, a continuously updated Meta-Dataset of the LOD cloud, tightly connected to the (re)published corresponding datasets which are crawled and cleaned by the LOD Laundromat. The C-LOD Meta-Dataset contains meta-data for over 12 billion triples (and growing). While traditionally dataset meta-data is often not provided, incomplete, or incomparable in the way they were generated, the C-LOD Meta-Dataset provides a wide variety of those properties using standardized vocabularies. This makes it a particularly useful dataset for data comparison and analytics, as well as for the global study of the Web of Data.
Full PDF Version: 
Revised Version:
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sebastian Hellmann submitted on 23/Mar/2015
Suggestion:
Major Revision
Review Comment:

General aspects of the paper.
This paper describes in concise and clear terms the C-LOD Meta dataset which describes meta data of datasets recrawled from the LOD Cloud. The crawling is made by the LOD Laundromat tool. In the interpretation of the authors, re-using standardized vocabularies means creating yet another ontology and linking to existing vocabularies. The aim of the work is to provide useful data for statistical data analysis as well comparison and for the global study of Web of Data. The authors pointed out problems with the current vocabularies for dataset description and they describe the problems giving examples (e.g: void:properties property).

Specific aspects:
(1) Quality and stability of the dataset:
The dataset relies on republished datasets crawled and cleaned using the LOD Laundromat, a tool published at ISWC 2014. However, the quality of the dataset is still not sufficiently demonstrated for the following reasons:
Provenance is one requirement stated at the beginning of the paper. In the examples we inspected such as: http://lodlaundromat.org/resource/f63aebbd3b867726c89d151d343dab0d we were unable to find a link to the original source.
Overall, the ontology covers only 5 missing properties from VoID-ext vocabulary. Since the reusability is one of the described requirements of the paper, we are actually unsure, whether this design choice can be justified. Doesn’t this end up in everybody reinventing vocabularies for their application? The authors need at least to rephrase their description. It is clearly not reusing, but creating a special purpose vocabulary that is linked to existing vocabs. Archived data should be better described.
We had problems finding the ontology. Properties like http://lodlaundromat.org/ontology/unpackedSize were dead and http://lodlaundromat.org/ontology/ doesn’t return anything.
Although the paper describes that license properties are part of the model, no evidence of license descriptions were found in the dataset.
The provided dataset given by the address http://download.lodlaundromat.org/dump.nt.gz doesn't have the same quality that the paper describes.

(2) Usefulness of the dataset, which should be shown by corresponding third-party uses - more evidence must be provided:
As third-party usage, the paper cites only the service PrefLabel (http://preflabel.org/). Although it's possible to verify the usage of the dataset accessing the PrefLabel code on GitHub, the maintainer and the authors are from the same organisation. I recommend to the authors cite more services which uses the C-LOD Meta dataset.
It's evident that C-LOD Meta dataset is useful when it comes to compare and overview the current state of the structure of different datasets. This was well described in the first two use cases of the paper, however the authors should consider that the lack of properties that describe license might be an issue when trying to find out whether data is public or not. Specially in the second use case, algorithms should know if dataset is public available for data extraction. The third and the last use cases present good examples of the dataset usefulness.

(3) Clarity and completeness of the descriptions.
The paper is well written and the authors provide information about how meta-data is fetched and the dataset created. A list of requirements is also provided and each item is detailed described. Although, this does not compromise the general quality of the paper, a poor quality figure is provided for dataset evaluation. I would strongly suggest for the authors to remake the figure.Table 1 compares the used vocabulary with other relevant vocabularies, such as Bio2RDF, VoID and VoID-ext, however doesn't provides reusability of standard vocabularies.
The importance of tracking provenance is empathized multiple times, since different computational procedures can calculate different values for the same property of the same dataset. The authors cite Prov-O as part of the work, but again, no evidences of provenance on the dataset (or in the ontology as well) were found.
The authors finally provides some use cases and the dataset usefulness become even more clear.

Further strengths of the paper:

* Centralization of datasets description and accessible by SPARQL endpoint.
* C-LOD Meta Dataset is daily updated and regarding of the dataset generation, meta-data is generated through streaming process.
* The general problem description is consistent and justify the importance of having a collection of dataset description.
* 4 of 5 stars have been reached. The vocabulary is recent and to reach the fifth star the vocabulary was submitted to the linked open vocabulary allowing better discovery reutilization.

It's also important to point some more weaknesses:
* Missing general dataset properties, like license, format, provenance, language, etc.
* The dataset generation process doesn't look to be extensible to other sources besides LOD Laundromat.
* The authors could compare side-by-side numbers from different approaches (e.g LODStats) and evaluate results. No evaluation of the correctness of the data has been given (Although one could assume that the implementation of meta data collection is not so difficult and error-prone.
* Its very hard to understand Figure 1 since has poor quality and there is no values on the X axis.

In conclusion, the paper should be accepted but major revisions are required based on our comments (This review was written jointly by Sebastian Hellmann and Ciro Baron)

Review #2
By Aidan Hogan submitted on 14/Apr/2015
Suggestion:
Major Revision
Review Comment:

The authors describe a meta-dataset that describes 12 billion triples of Linked Data sourced from various locations. In general, the dataset appears very useful in my eyes as a sort of empirical version of the datahub.io catalogue: rather than relying on publisher-submitted, potentially biased meta-data, it generates dataset descriptions using a consistent framework under more "controlled" conditions. Thus the work appears to be a useful contribution to the community.

However, in my mind, the paper needs quite a lot of work. In this review, I just wanted to add some clarifications that the authors should make and improvements to the writing that are needed. (The overall meta-review including the comments of other reviewers will follow with the decision letter.)

# The abstract does not immediately make it clear what the purpose or the scope of the dataset is. It states that it C-LOD is a continuously updated "Meta-Dataset of the LOD cloud" but it was not clear at all to me what that means, first because I don't know what a meta-dataset is, and second because the LOD cloud is a notoriously nebulous concept. I'd like more concrete details here, something like how many datasets are indexed, how many sources are tapped, what the size of the resulting "meta-dataset" is, etc.

# Relatedly, a lot of the phrasing makes it immediately unclear if the C-LOD dataset contains 12 billion triples or if it contains the meta-data for 12 billion triples. In fact in almost all cases where this figure was mentioned the wording was ambiguous. Please clarify this throughout.

# What is a "Linked Data Document"? I presume the intention here is to refer to dumps, but that is not immediately clear. In fact, 26,000 would be a tiny corpus considering datasets like the BTC regularly contain millions of documents. Could you say (at least even intuitively) what the 26,000 refers to?

# What are "Big Data research scenarios" mentioned in the introductory paragraph specifically? Can you name examples? Otherwise to me it honestly sounds vaguely hand-wavy.

# The related work section seems a little light. For example, surely you should cite the original VoID paper to give proper credit:

Keith Alexander, Richard Cyganiak, Michael Hausenblas, Jun Zhao:
Describing Linked Datasets. LDOW 2009

Likewise there are other works that seem to be directly relevant and that should be discussed:

Olaf Hartig, Jun Zhao:
Publishing and Consuming Provenance Metadata on the Web of Linked Data. IPAW 2010: 78-90

Tope Omitola, Landong Zuo, Christopher Gutteridge, Ian Millard, Hugh Glaser, Nicholas Gibbins, Nigel Shadbolt:
Tracing the provenance of linked data using voiD. WIMS 2011: 17

Potentially there are more works that should be discussed. In general, a careful treatment of related work is important irrespective of the track.

# You talk about only computing metrics based on streaming that avoids loading large parts of the dataset into memory. But yet your metrics include things like distinct IRIs that require some sort of uniquing procedure over the names in the dataset. Is this not contradictory? If not, how are metrics like distinct IRIs counted via streaming without loading a significant part of the data?

# A summary of metrics about the dataset would be useful, including the number of triples processed, the number of sources processed, the number of triples computed, the number of datasets mentioned in the output, historical data, etc.

# The expression of third-party use is really quite weak. Can you point to any other (possibly soft) evidence that your dataset is in use; for example, can you point to SPARQL query logs?

# The main use-case appears to be the first one mentioned in a short paragraph. I would recommend to expand on this. The second and third use-cases are a bit vague but okay. The fourth use-case feels like too far of a stretch and its presence, for me, weakens the whole section and gives the impression that the authors were struggling to think up use-cases. I would recommend (i) to extend the first use-case and maybe talk more about exploring and finding datasets; talk about the potential benefits of the dataset over something like datahub; (ii) provide some working SPARQL queries over your endpoint to show some concrete examples; (iii) merge the second and third use-cases into one, (iv) drop the use-case about PageRank.

MINOR COMMENTS AND TYPOS:

* Starting section 2 with a numbered reference looks quite ugly.
* Fix the bad box at "IRIs of the form ..."
* that generate{s} meta-data
* "format. (one example is LOD-Stats)" Put the full-stop after the parentheses.
* A max in-degree of 7 million is not all that strange. First you didn't restrict triples with predicate "rdf:type" where it would not be surprising for a class to have 7 million instances. Second, entities like categories in DBpedia are likely to have such an in-degree. I'm not really sure what point you are making here.


Comments

The paper “LOD in a Box” introduces the Clean and Linked Open Data (C-LOD) Meta-Dataset, which allows to publish comparable Dataset descriptions. This novel vocabulary is disseminated as LOD and offers algorithmically generated information for currently about 13.5 billion triples in 26,000 documents – all crawled by the LOD Laundromat.

The authors point out, that existing datasets, such as LODStats and Sindice, show a lack of comparable meta-data, leave a lot of space for interpretation and/ or ignore information on the meta-data creation process. After performing a requirements analysis there will be seven key aspects which lead to the new LOD Meta-Dataset. Those seven aspects are prioritized afterwards in the paper. Key characteristics and novel meta-data properties are introduced, explained in detail and comparatively discussed by delimitations between the introduced dataset and other specifications, such as VoID, Bio2RDF and VoID-ext. The vocabulary, generation code and daily updated data dumps (currently 10GB raw data) are available via the given access points. The paper concludes with a summery and two possible future works regarding a more efficient creation of the C-LOD Meta Dataset.

This contribution is relevant to the community and has a high (potential) usefulness, as it offers a novel way of accessing recent description meta-data on public available datasets that can be used for comparison and analysis purposes. New statistical properties, such as standard deviation, min. and max. values, as well as the introduced provenance trail are important for an adequate interpretation of the dataset quality. However, the current version 1.0.0 could be updated by even more (statistical) values, e.g. mean deviation, variance and other parameters of dispersion. Certainly, additional requirements will be defined by the community.

The paper is written well, lucid and comprehensible. The used terminology is technically correct and the specification seems to be valid. The paper structure is well organized and the purpose of this Linked Dataset becomes clear. Nevertheless, the paper could benefit from some introductions and more references. For instance: “SW algorithms” are mentioned in the use case section, but neither explained nor cited. Even though it stands for Semantic Web, it could also mean Smith–Waterman algorithm. And VoID, for instance, is referenced in section 3, but, already utilized in section 2.

The paper provides all required information, such as name, URL, versioning, licensing and availability, which in particular are being stated in the used vocabulary. The introduced vocabulary in turn re-uses other established vocabularies e.g. PROV-O, HTTP and Error Ontology. Since dumps are available on a regular basis and application developers can use SPARQL for querying data, the lack of concrete applications using this vocabulary is therefore of no consequence. However, the paper would benefit from some more evaluation results, similar to the maximum indegree distribution figure in order to underline the importance of the introduced vocabulary. In terms of the Five Stars of Linked Data Vocabulary Use, the authors are right, when classifying their vocabulary as 4/5 stars, since this vocabulary is not linked by other vocabularies yet.

Paper strengths:
• Novel LOD Meta Dataset representation for comparison purposes based on a sophisticated requirements analysis
• Clear specification including new properties for statistical evaluations
• Available sources: Vocabulary (Creative Commons 3), code on GitHub and data dumps on a daily updated basis

Paper weaknesses:
• Little evaluation of mined and compared meta-data
• Lack of some introductions and references
• Not re-used by other vocabularies so far (5th star of Linked Data Vocabulary Use)

I would encourage to require some minor revisions and would ask the authors to update the paper on the above-mentioned points.