Meta-Data for a lot of LOD

Tracking #: 1115-2327

Authors: 
Laurens Rietveld
Wouter Beek
Stefan Schlobach

Responsible editor: 
Aidan Hogan

Submission type: 
Dataset Description
Abstract: 
This paper introduces the LOD Laundromat Meta-Dataset, a continuously updated Meta-Dataset of the LOD Laundromat, tightly connected to the (re)published corresponding datasets which are crawled and cleaned by the LOD Laundromat [5]. The Meta-Dataset contains structural information for over 38 billion billion triples (and growing). While traditionally dataset meta-data is often not provided, incomplete, or incomparable in the way they were generated, the LOD Laundromat Meta-Dataset provides a wide variety of structural dataset properties using standardized vocabularies. This makes it a particularly useful dataset for data comparison and analytics, as well as for the global study of the Web of Data.
Full PDF Version: 
Revised Version:
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christopher Krauss submitted on 11/Aug/2015
Suggestion:
Accept
Review Comment:

Updated Review: The paper introduces the LOD Laundromat Meta-Dataset (recently Clean and Linked Open Data - C-LOD Meta-Dataset) that allows to publish comparable Dataset descriptions. This novel vocabulary is disseminated as LOD and offers algorithmically generated information for currently about 38 billion triples in 650,000 documents – all crawled by the LOD Laundromat.
The authors pointed out, that existing datasets, such as LODStats and Sindice, show a lack of comparable meta-data, leave a lot of space for interpretation and/ or ignore information on the meta-data creation process. Therefore, a requirements analysis results in 7 key aspects for the definition of the new LOD Meta-Dataset, which are prioritized afterwards. Key characteristics and novel meta-data properties are introduced, explained in detail and comparatively discussed by delimitations between the introduced dataset and other specifications, such as VoID, Bio2RDF and VoID-ext. The paper concludes with a summery and two possible future works regarding a more efficient creation of the Meta Dataset.
This contribution is relevant to the community and has a high potential usefulness, as it offers a novel way of accessing recent description meta-data on public available datasets that can be used for comparison and analysis purposes. The paper is written well, lucid and comprehensible. The used terminology is technically correct and the specification seems to be valid. The paper structure is well organized and the purpose of this Linked Dataset becomes clear. In contrast to the last version, the paper benefits from better introductions and more evaluation results.
All required information (name, URL, versioning, licensing and availability) are given in the paper and especially stated in the vocabulary that in turn re-uses other established vocabularies – e.g. PROV-O, HTTP and Error Ontology. In terms of the Five Stars of Linked Data Vocabulary Use, the authors classified their vocabulary as 4/5 stars, since this vocabulary is not linked by other vocabularies yet.
Paper strengths:
- Novel LOD Meta Dataset representation for comparison purposes based on a sophisticated requirements analysis
- Clear specification including new properties for statistical evaluations
- Available sources: Vocabulary (Creative Commons 3), code on GitHub and data dumps on a daily updated basis
Paper weaknesses:
- Not re-used by other vocabularies so far (5th star of Linked Data Vocabulary Use)
I would encourage to accept the paper.

Review #2
By Juergen Umbrich submitted on 20/Aug/2015
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .

Review #3
By Aidan Hogan submitted on 20/Aug/2015
Suggestion:
Minor Revision
Review Comment:

Compared to the previous version "LOD in a Box: The C-LOD Meta-Dataset", the authors have made significant improvements. The clarity of the writing has improved a lot and the readability of the paper has improved likewise. Various details of the system have been clarified, and its relation to the LOD Laundromat is now much clearer. References are provided for all the works reused in the paper and seem to be much more complete. For this reason, I recommend an accept with the following (very) minor revisions.

MINOR COMMENTS:

* "For example, Bio2RDF ... to from each predicate." How Bio2RDF extends/amends VoID was not clear to me from the text. Please try to explain this better. Also is the information about the number of entities linked to/from a given predicate not already included in the property partitions of VoID in combination with distinct-subject, distinct-object?

* "Firstly, several definitions of meta-data property are ambiguous. As an example we take the VoiD property "void:properties" which ought to denote the number of distinct RDF properties that occur in a dataset. ..." The original authors of VoID clear define what void:properties refers to in both official documents [1,2]: "number of properties – The total number of distinct properties in a void:Dataset. In other words, the number of distinct resources that occur in the predicate position of triples in the dataset." (A similar operational definition is given for "void:classes".) The ambiguity that the authors claim in the paper is thus simply not there in the VoID specification. The answer for the example they give is unambiguously 4. The authors should remove any claims that void:properties is ambiguous or suggesting that it is unclear whether or not properties not appearing as a predicate should be counted (they clearly should not). Instead, you could perhaps argue that it is interpreted incorrectly in some datasets or tools (if you can point to examples). An easier solution would be to point to the fact that "void:entities" is ambiguous for the reasons you mention (which I understand was left that way by design), and that something like "void:uriLookupEndpoint" cannot be determined from the data and typically needs to be specified manually.

* It would be useful to add links for some example Linked Data IRIs to the paper with some description of the naming scheme, e.g.:
http://lodlaundromat.org/resource/dcefd287bd0c7d4efb543c7260b6dcf8
http://lodlaundromat.org/resource/dcefd287bd0c7d4efb543c7260b6dcf8/metrics
http://lodlaundromat.org/resource/dcefd287bd0c7d4efb543c7260b6dcf8/metri...
etc.

TYPOS, etc.:

Throughout:
* "VoiD" -> "VoID" seems to be how the authors stylise it

Abstract:
* "38 billion billion"

Section 3.1:
* "higher level. I.e., it includes" -> "higher level; i.e., it includes"
* "ambigue" -> "ambiguous"

Section 4.1:
* "Considering [that] only" (For readability rather than grammar)
* I hope you don't mind me saying this but as far as large grey rectangles go, I think Figure 1 might be one of the ugliest I've ever seen. In addition, it doesn't contain ticks on the x-axis, the origin isn't clear, it probably doesn't need to span two columns, etc. Could you clean it up just a little bit and make it somehow less depressing to look at?
* Table 1, caption should go above the table I think? Check the journal style. You should also add to the caption something to indicate that only properties computable from an RDF dataset without manual intervention are considered, since VoID contains a bunch of other properties not included (e.g., open search description, link sets, etc.). Also watch the capitalisation in the meta-data property names.

Section 4.4:
"over 650,000 containing" Clarify over 650,000 _what_.

References:
The style of references needs to be cleaned up, especially since this is a journal paper. For example, #1, "LDOW" is not enough details for a workshop. Words such as "rdf" appear in lower-case. Sometimes the publisher (e.g., Springer) is given, sometimes it's not. Sometimes pages are given, sometimes not. On reference #11, the publisher is given as Citeseer. #3 uses "et al." for six authors while #14 lists all ten authors. Also remember to update reference #18 accordingly. In general, clean-up and thoroughly revise the reference style/details.

[1] http://vocab.deri.ie/void
[2] http://www.w3.org/TR/void/


Comments

These were the full comments of Review 2, which were accidently posted in the Comments for Editor field but intended to be made public:
--------------------------------------------------------------------------------

This paper present a dataset containing typical metadata about a large number of RDF datasets from the LOD Laundromat project.
The purpose and value of this dataset is to represent uniformed statistics about a possible subset of the Web of Data by applying the same set of algorithms over cleaned input data. Thus, providing comparable and objective statistics which can be used in other projects to filter and select datasets (e.g. by the number of triples, occurrence of certain properties/classes or other features such as in/outdegree) or to compare datasets.

SWJ dataset evaluation criteria:
(1) Quality and stability of the dataset -
I succeed to download the dataset at several occasions with the correct date for the day of download.
I judge the quality as good considering that the dataset is created by the same algorithms.

(2) Usefulness of the dataset,
I think the authors provide a good motivation for the usefulness of one homogenized meta dataset of a large amount of the RDF/LOD data on the Web.

>which should be shown by corresponding third-party uses - evidence must be provided.

The paper mentions only use of the dataset from the LOD Laundromat group and one third party use case (preflabel).

Maybe, the dataset is very new and as such there is not much use by third-parties. Another reason might be that this dataset is also not well advertised (i could not find the link the meta dataset dump easily ;) )

(3) Clarity and completeness of the descriptions.

I think the paper does provides a good level of details about the dataset but could benefit from more information about the structure and internal modeling.

This would help to further promote and motivate the use of the MetaDataset by third-parties.

For instance, a visualisation of the general structure of the metadata description would help, e.g. as "pseudo" UML or graph diagram.
In fact, only after browsing the dataset, i was able to extract the structure and connection between metrics, dataset and used software ( a diagram or even example queries would have been very useful).

I assume that the dataset is very well structured and it would be good to provide some statistics about the metadataset as well, such as
average number of metrics value per dataset if this is not equal for all datasets, etc... ( maybe a highlevel meta-metadataset description ;) )

Overall, I think the paper addressed 2 of the 3 criteria in a sufficient way but could be improved in the clarity and completeness.
Some of the information are easy to understand for SW experts which also deal with meta data descriptions such as VoiD , DCAT etc.
However, people less familiar with the domain might have a more difficulties to understand the need for such a dataset and how to exploit use the resulting corpus. For instance, one thing which was not entirely clear to me is the discussion about dataset partitions and also that this should be published not as SPARQL.

Another point which could be more elaborated is the creation and update process of the dataset. Maybe providing some statistics about the how long it takes to process a dataset ( e.g. triples per second) and how a new version is generated would help to understand how accurate a version of the meta dataset is.

Some more comments:

++ ABSTRACT ++

The abstract could be clearer and could contain more details about the paper. Maybe highlight one or two structural properties this dataset provides in comparison to the original datasets ( e.g. number of statements, etc... )
I would also mention that the dataset is available in RDF even if it is obvious to most of the readers.

++ INTRODUCTION

"make innovative use of meta-data values in which algorithms" -> can you provide references for this claim?
While i was reading the introduction, I was wondering what kind of meta-data properties you might refer to as being incorrect, missing or outdates. I would suggest to provide more details about the meta-data properties referred in the text.

++ META-DATA REQUIREMENTS

+Section 3.1

The example might be clear to people familiar with RDF and RDFS. However, i would again suggest to provide more details about the example and also maybe highlight the 4 predicate terms ( e.g. bold) and the 9 rdfs:Properties ( e.g. underline).
Same applies for the owl:sameAs example. Maybe add a sentence that this statements would reduce the actual number of distinct rdfs:properties to 8 due to the semantics of owl:sameAs.

++ Section 4

Table 1: Maybe provide prefixes and namespaces for the properties

+ Model

I think one crucial information which should be mentioned in the paper is the time dimension of the meta data for a dataset.
the authors already mentioned this point that meta data description is sometimes outdated and the LOD Laundromat fixes this issue by harvesting one dump, adding a timestamp and computing the meta data for that dataset and timestamp.

Maybe this crucial point should be made more prominent in this section as part of the provenance discussion.

+ Dissemination

It would be great to provide more details about the generation and dissemination process.

The stream approach should be linear with the size of a dataset and maybe you can provide some performance statistics about the average processing time per statements.

Also some insights about the update strategy would be very interesting. How up-to-date is the dump and SPARQL content? I assume that the dump is just an extraction of the endpoint content.

+Dataset statistics:

When exactly was the MetaDataset released?

Does the numbers refer to access to the provided data dump at http://download.lodlaundromat.org/dump.nt.gz, or in general to dumps?

Could you maybe provide more details about who accessed the servers (e.g. by country) or if the access are stable/increasing or decreasing over time?

Also maybe a quick outlook about the diversity of the queries could be interesting ( e.g. by briefly comparing the where clause)
"... crawled and republished over 650.000 (???)" containing over 38 000 000 triples -> missing word and i think it was 38 billions triples ?

Just for clarification and to avoid confusions: the authors write that the meta dataset is published in the same format as the other LOD Laundromat datasets.

The meta datadump is published as nt.gz file, but my understanding is that LOD Laundromat provides also datasets as HDT files ?

++ Future work

My understanding is that the meta dataset is computed once a day, the authors could just provide also previous versions of the meta data snapshots

==== Minor improvements

Abstract:
I would suggest to rephrase the first sentence to reduce the number of times LOD Laundromat is mentioned

Introduction:
"very many" -> this expression seems correct ( considering my limited knowledge of the english language)