Review Comment:
The authors describe a meta-dataset that describes 12 billion triples of Linked Data sourced from various locations. In general, the dataset appears very useful in my eyes as a sort of empirical version of the datahub.io catalogue: rather than relying on publisher-submitted, potentially biased meta-data, it generates dataset descriptions using a consistent framework under more "controlled" conditions. Thus the work appears to be a useful contribution to the community.
However, in my mind, the paper needs quite a lot of work. In this review, I just wanted to add some clarifications that the authors should make and improvements to the writing that are needed. (The overall meta-review including the comments of other reviewers will follow with the decision letter.)
# The abstract does not immediately make it clear what the purpose or the scope of the dataset is. It states that it C-LOD is a continuously updated "Meta-Dataset of the LOD cloud" but it was not clear at all to me what that means, first because I don't know what a meta-dataset is, and second because the LOD cloud is a notoriously nebulous concept. I'd like more concrete details here, something like how many datasets are indexed, how many sources are tapped, what the size of the resulting "meta-dataset" is, etc.
# Relatedly, a lot of the phrasing makes it immediately unclear if the C-LOD dataset contains 12 billion triples or if it contains the meta-data for 12 billion triples. In fact in almost all cases where this figure was mentioned the wording was ambiguous. Please clarify this throughout.
# What is a "Linked Data Document"? I presume the intention here is to refer to dumps, but that is not immediately clear. In fact, 26,000 would be a tiny corpus considering datasets like the BTC regularly contain millions of documents. Could you say (at least even intuitively) what the 26,000 refers to?
# What are "Big Data research scenarios" mentioned in the introductory paragraph specifically? Can you name examples? Otherwise to me it honestly sounds vaguely hand-wavy.
# The related work section seems a little light. For example, surely you should cite the original VoID paper to give proper credit:
Keith Alexander, Richard Cyganiak, Michael Hausenblas, Jun Zhao:
Describing Linked Datasets. LDOW 2009
Likewise there are other works that seem to be directly relevant and that should be discussed:
Olaf Hartig, Jun Zhao:
Publishing and Consuming Provenance Metadata on the Web of Linked Data. IPAW 2010: 78-90
Tope Omitola, Landong Zuo, Christopher Gutteridge, Ian Millard, Hugh Glaser, Nicholas Gibbins, Nigel Shadbolt:
Tracing the provenance of linked data using voiD. WIMS 2011: 17
Potentially there are more works that should be discussed. In general, a careful treatment of related work is important irrespective of the track.
# You talk about only computing metrics based on streaming that avoids loading large parts of the dataset into memory. But yet your metrics include things like distinct IRIs that require some sort of uniquing procedure over the names in the dataset. Is this not contradictory? If not, how are metrics like distinct IRIs counted via streaming without loading a significant part of the data?
# A summary of metrics about the dataset would be useful, including the number of triples processed, the number of sources processed, the number of triples computed, the number of datasets mentioned in the output, historical data, etc.
# The expression of third-party use is really quite weak. Can you point to any other (possibly soft) evidence that your dataset is in use; for example, can you point to SPARQL query logs?
# The main use-case appears to be the first one mentioned in a short paragraph. I would recommend to expand on this. The second and third use-cases are a bit vague but okay. The fourth use-case feels like too far of a stretch and its presence, for me, weakens the whole section and gives the impression that the authors were struggling to think up use-cases. I would recommend (i) to extend the first use-case and maybe talk more about exploring and finding datasets; talk about the potential benefits of the dataset over something like datahub; (ii) provide some working SPARQL queries over your endpoint to show some concrete examples; (iii) merge the second and third use-cases into one, (iv) drop the use-case about PageRank.
MINOR COMMENTS AND TYPOS:
* Starting section 2 with a numbered reference looks quite ugly.
* Fix the bad box at "IRIs of the form ..."
* that generate{s} meta-data
* "format. (one example is LOD-Stats)" Put the full-stop after the parentheses.
* A max in-degree of 7 million is not all that strange. First you didn't restrict triples with predicate "rdf:type" where it would not be surprising for a class to have 7 million instances. Second, entities like categories in DBpedia are likely to have such an in-degree. I'm not really sure what point you are making here.
|
Comments
Review - Suggestion: Minor revision
The paper “LOD in a Box” introduces the Clean and Linked Open Data (C-LOD) Meta-Dataset, which allows to publish comparable Dataset descriptions. This novel vocabulary is disseminated as LOD and offers algorithmically generated information for currently about 13.5 billion triples in 26,000 documents – all crawled by the LOD Laundromat.
The authors point out, that existing datasets, such as LODStats and Sindice, show a lack of comparable meta-data, leave a lot of space for interpretation and/ or ignore information on the meta-data creation process. After performing a requirements analysis there will be seven key aspects which lead to the new LOD Meta-Dataset. Those seven aspects are prioritized afterwards in the paper. Key characteristics and novel meta-data properties are introduced, explained in detail and comparatively discussed by delimitations between the introduced dataset and other specifications, such as VoID, Bio2RDF and VoID-ext. The vocabulary, generation code and daily updated data dumps (currently 10GB raw data) are available via the given access points. The paper concludes with a summery and two possible future works regarding a more efficient creation of the C-LOD Meta Dataset.
This contribution is relevant to the community and has a high (potential) usefulness, as it offers a novel way of accessing recent description meta-data on public available datasets that can be used for comparison and analysis purposes. New statistical properties, such as standard deviation, min. and max. values, as well as the introduced provenance trail are important for an adequate interpretation of the dataset quality. However, the current version 1.0.0 could be updated by even more (statistical) values, e.g. mean deviation, variance and other parameters of dispersion. Certainly, additional requirements will be defined by the community.
The paper is written well, lucid and comprehensible. The used terminology is technically correct and the specification seems to be valid. The paper structure is well organized and the purpose of this Linked Dataset becomes clear. Nevertheless, the paper could benefit from some introductions and more references. For instance: “SW algorithms” are mentioned in the use case section, but neither explained nor cited. Even though it stands for Semantic Web, it could also mean Smith–Waterman algorithm. And VoID, for instance, is referenced in section 3, but, already utilized in section 2.
The paper provides all required information, such as name, URL, versioning, licensing and availability, which in particular are being stated in the used vocabulary. The introduced vocabulary in turn re-uses other established vocabularies e.g. PROV-O, HTTP and Error Ontology. Since dumps are available on a regular basis and application developers can use SPARQL for querying data, the lack of concrete applications using this vocabulary is therefore of no consequence. However, the paper would benefit from some more evaluation results, similar to the maximum indegree distribution figure in order to underline the importance of the introduced vocabulary. In terms of the Five Stars of Linked Data Vocabulary Use, the authors are right, when classifying their vocabulary as 4/5 stars, since this vocabulary is not linked by other vocabularies yet.
Paper strengths:
• Novel LOD Meta Dataset representation for comparison purposes based on a sophisticated requirements analysis
• Clear specification including new properties for statistical evaluations
• Available sources: Vocabulary (Creative Commons 3), code on GitHub and data dumps on a daily updated basis
Paper weaknesses:
• Little evaluation of mined and compared meta-data
• Lack of some introductions and references
• Not re-used by other vocabularies so far (5th star of Linked Data Vocabulary Use)
I would encourage to require some minor revisions and would ask the authors to update the paper on the above-mentioned points.