Review Comment:
Compared to the previous version "LOD in a Box: The C-LOD Meta-Dataset", the authors have made significant improvements. The clarity of the writing has improved a lot and the readability of the paper has improved likewise. Various details of the system have been clarified, and its relation to the LOD Laundromat is now much clearer. References are provided for all the works reused in the paper and seem to be much more complete. For this reason, I recommend an accept with the following (very) minor revisions.
MINOR COMMENTS:
* "For example, Bio2RDF ... to from each predicate." How Bio2RDF extends/amends VoID was not clear to me from the text. Please try to explain this better. Also is the information about the number of entities linked to/from a given predicate not already included in the property partitions of VoID in combination with distinct-subject, distinct-object?
* "Firstly, several definitions of meta-data property are ambiguous. As an example we take the VoiD property "void:properties" which ought to denote the number of distinct RDF properties that occur in a dataset. ..." The original authors of VoID clear define what void:properties refers to in both official documents [1,2]: "number of properties – The total number of distinct properties in a void:Dataset. In other words, the number of distinct resources that occur in the predicate position of triples in the dataset." (A similar operational definition is given for "void:classes".) The ambiguity that the authors claim in the paper is thus simply not there in the VoID specification. The answer for the example they give is unambiguously 4. The authors should remove any claims that void:properties is ambiguous or suggesting that it is unclear whether or not properties not appearing as a predicate should be counted (they clearly should not). Instead, you could perhaps argue that it is interpreted incorrectly in some datasets or tools (if you can point to examples). An easier solution would be to point to the fact that "void:entities" is ambiguous for the reasons you mention (which I understand was left that way by design), and that something like "void:uriLookupEndpoint" cannot be determined from the data and typically needs to be specified manually.
* It would be useful to add links for some example Linked Data IRIs to the paper with some description of the naming scheme, e.g.:
http://lodlaundromat.org/resource/dcefd287bd0c7d4efb543c7260b6dcf8
http://lodlaundromat.org/resource/dcefd287bd0c7d4efb543c7260b6dcf8/metrics
http://lodlaundromat.org/resource/dcefd287bd0c7d4efb543c7260b6dcf8/metri...
etc.
TYPOS, etc.:
Throughout:
* "VoiD" -> "VoID" seems to be how the authors stylise it
Abstract:
* "38 billion billion"
Section 3.1:
* "higher level. I.e., it includes" -> "higher level; i.e., it includes"
* "ambigue" -> "ambiguous"
Section 4.1:
* "Considering [that] only" (For readability rather than grammar)
* I hope you don't mind me saying this but as far as large grey rectangles go, I think Figure 1 might be one of the ugliest I've ever seen. In addition, it doesn't contain ticks on the x-axis, the origin isn't clear, it probably doesn't need to span two columns, etc. Could you clean it up just a little bit and make it somehow less depressing to look at?
* Table 1, caption should go above the table I think? Check the journal style. You should also add to the caption something to indicate that only properties computable from an RDF dataset without manual intervention are considered, since VoID contains a bunch of other properties not included (e.g., open search description, link sets, etc.). Also watch the capitalisation in the meta-data property names.
Section 4.4:
"over 650,000 containing" Clarify over 650,000 _what_.
References:
The style of references needs to be cleaned up, especially since this is a journal paper. For example, #1, "LDOW" is not enough details for a workshop. Words such as "rdf" appear in lower-case. Sometimes the publisher (e.g., Springer) is given, sometimes it's not. Sometimes pages are given, sometimes not. On reference #11, the publisher is given as Citeseer. #3 uses "et al." for six authors while #14 lists all ten authors. Also remember to update reference #18 accordingly. In general, clean-up and thoroughly revise the reference style/details.
[1] http://vocab.deri.ie/void
[2] http://www.w3.org/TR/void/
|
Comments
Full Review 2
These were the full comments of Review 2, which were accidently posted in the Comments for Editor field but intended to be made public:
--------------------------------------------------------------------------------
This paper present a dataset containing typical metadata about a large number of RDF datasets from the LOD Laundromat project.
The purpose and value of this dataset is to represent uniformed statistics about a possible subset of the Web of Data by applying the same set of algorithms over cleaned input data. Thus, providing comparable and objective statistics which can be used in other projects to filter and select datasets (e.g. by the number of triples, occurrence of certain properties/classes or other features such as in/outdegree) or to compare datasets.
SWJ dataset evaluation criteria:
(1) Quality and stability of the dataset -
I succeed to download the dataset at several occasions with the correct date for the day of download.
I judge the quality as good considering that the dataset is created by the same algorithms.
(2) Usefulness of the dataset,
I think the authors provide a good motivation for the usefulness of one homogenized meta dataset of a large amount of the RDF/LOD data on the Web.
>which should be shown by corresponding third-party uses - evidence must be provided.
The paper mentions only use of the dataset from the LOD Laundromat group and one third party use case (preflabel).
Maybe, the dataset is very new and as such there is not much use by third-parties. Another reason might be that this dataset is also not well advertised (i could not find the link the meta dataset dump easily ;) )
(3) Clarity and completeness of the descriptions.
I think the paper does provides a good level of details about the dataset but could benefit from more information about the structure and internal modeling.
This would help to further promote and motivate the use of the MetaDataset by third-parties.
For instance, a visualisation of the general structure of the metadata description would help, e.g. as "pseudo" UML or graph diagram.
In fact, only after browsing the dataset, i was able to extract the structure and connection between metrics, dataset and used software ( a diagram or even example queries would have been very useful).
I assume that the dataset is very well structured and it would be good to provide some statistics about the metadataset as well, such as
average number of metrics value per dataset if this is not equal for all datasets, etc... ( maybe a highlevel meta-metadataset description ;) )
Overall, I think the paper addressed 2 of the 3 criteria in a sufficient way but could be improved in the clarity and completeness.
Some of the information are easy to understand for SW experts which also deal with meta data descriptions such as VoiD , DCAT etc.
However, people less familiar with the domain might have a more difficulties to understand the need for such a dataset and how to exploit use the resulting corpus. For instance, one thing which was not entirely clear to me is the discussion about dataset partitions and also that this should be published not as SPARQL.
Another point which could be more elaborated is the creation and update process of the dataset. Maybe providing some statistics about the how long it takes to process a dataset ( e.g. triples per second) and how a new version is generated would help to understand how accurate a version of the meta dataset is.
Some more comments:
++ ABSTRACT ++
The abstract could be clearer and could contain more details about the paper. Maybe highlight one or two structural properties this dataset provides in comparison to the original datasets ( e.g. number of statements, etc... )
I would also mention that the dataset is available in RDF even if it is obvious to most of the readers.
++ INTRODUCTION
"make innovative use of meta-data values in which algorithms" -> can you provide references for this claim?
While i was reading the introduction, I was wondering what kind of meta-data properties you might refer to as being incorrect, missing or outdates. I would suggest to provide more details about the meta-data properties referred in the text.
++ META-DATA REQUIREMENTS
+Section 3.1
The example might be clear to people familiar with RDF and RDFS. However, i would again suggest to provide more details about the example and also maybe highlight the 4 predicate terms ( e.g. bold) and the 9 rdfs:Properties ( e.g. underline).
Same applies for the owl:sameAs example. Maybe add a sentence that this statements would reduce the actual number of distinct rdfs:properties to 8 due to the semantics of owl:sameAs.
++ Section 4
Table 1: Maybe provide prefixes and namespaces for the properties
+ Model
I think one crucial information which should be mentioned in the paper is the time dimension of the meta data for a dataset.
the authors already mentioned this point that meta data description is sometimes outdated and the LOD Laundromat fixes this issue by harvesting one dump, adding a timestamp and computing the meta data for that dataset and timestamp.
Maybe this crucial point should be made more prominent in this section as part of the provenance discussion.
+ Dissemination
It would be great to provide more details about the generation and dissemination process.
The stream approach should be linear with the size of a dataset and maybe you can provide some performance statistics about the average processing time per statements.
Also some insights about the update strategy would be very interesting. How up-to-date is the dump and SPARQL content? I assume that the dump is just an extraction of the endpoint content.
+Dataset statistics:
When exactly was the MetaDataset released?
Does the numbers refer to access to the provided data dump at http://download.lodlaundromat.org/dump.nt.gz, or in general to dumps?
Could you maybe provide more details about who accessed the servers (e.g. by country) or if the access are stable/increasing or decreasing over time?
Also maybe a quick outlook about the diversity of the queries could be interesting ( e.g. by briefly comparing the where clause)
"... crawled and republished over 650.000 (???)" containing over 38 000 000 triples -> missing word and i think it was 38 billions triples ?
Just for clarification and to avoid confusions: the authors write that the meta dataset is published in the same format as the other LOD Laundromat datasets.
The meta datadump is published as nt.gz file, but my understanding is that LOD Laundromat provides also datasets as HDT files ?
++ Future work
My understanding is that the meta dataset is computed once a day, the authors could just provide also previous versions of the meta data snapshots
==== Minor improvements
Abstract:
I would suggest to rephrase the first sentence to reduce the number of times LOD Laundromat is mentioned
Introduction:
"very many" -> this expression seems correct ( considering my limited knowledge of the english language)