Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.
In my opinion the paper addresses the interesting challenge of representing Wikidata constraints in SHACL core or SPARQL. I think the paper conforms to:
1.- Originality. As far as I know, the work presented is original and has only been partially presented by the authors in a Wikidata workshop.
2.- Significance of the results. The results presented are significant for the semantic web community as Wikidata is nowadays a very important knowledge graph driven by the community which has some peculiarities that are worth to study and analyze.
3.- Quality of writing. The authors have done an important work to make the paper readable by people who may be interested in both the wikidata data model and the way that it can be validated following SHACL or SPARQL technologies.
The resources are also available in github.
The authors have addressed most of my previous comments and I think the paper can almost be accepted after some minor revisions:
- The authors use several times the "proprietary" word to refer to the wikidata reification data model. I am not sure if it is a good choice as at least in my case, "proprietary" in the context of software seems in contrast to open, like "proprietary versus open source software" and as far as I know, most of the things in wikidata follow an open source tradition. Maybe, what the authors are trying to indicate is that the reification model is specific to Wikidata? Maybe replacing "proprietary" by "specific", "custom" or wikidata's own reificiation model would be better. This is mainly a suggestion as I am not a native English speaker, so maybe the authors are right and "proprietary" is the best word.
- In section 2.1 "Data modeling and Wikidata" the authors are trying to present a tutorial about wikidata data model which I appreciate. However, I think the tutorial should be reviewed a little bit because it seems that some concepts are mentioned without explanation or not in a sequential way. For example, the authors start by introducing RDF and for that, they use a graph G based on figure 1, which would be difficult to justify to a reader because it is quite a complex representation, with all those large wds:... URIs. Maybe, I would suggest to start by introducing the Wikidata data model, and later on, briefly describing RDF and the RDF serialization used by Wikidata.
- Something similar happens when the authors want to explain the preferred and normal ranks in page 8 and use SPARQL and the wikidata query service which have not been mentioned before. I think if the authors start explaining the wikidata data model, and later the RDF serialization and SPARQL, it would be easier to follow the contents in a more sequential way.
- Following the previous suggestion, the authors could also briefly mention that the RDF serializations employed in Wikidata are used in three ways: in the Wikidata query service, in the RDF dumps (which are mentioned in page 9, line 47) and when retrieving information about an item in Turtle using content negotiation, e.g. the result of running
curl -L -H "Accept:text/turtle" https://www.wikidata.org/entity/Q615
As far as I know there are some slight differences between those RDF serializations:
- https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format
- https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#WDQS_da...
The authors mention in section 2.4.4 as a surprise finding, but maybe it is a design choice of Wikidata which could be explained by the custom RDF serialization mechanism employed.
- Page 6, line 30, "URIs represent hashes for "anonymous" reified statement and quantity value nodes..."
- Page 17. "although subproperty of (P1647) is considered the pendant of subclass of (P279) when it comes to the hierarchy of properties" I was surprised by the word "pendant" here...again, maybe it is right, but I would ask the authors to double check if it is the right word.
- Page 39, line 37, "were proposed before SHACL became a standard" is wrong as SHACL is not a standard but a W3C recommendation. It should be: "were proposed before SHACL became a W3C recommendation".
|