Managing linked open data in Wikidata using the W3C Generating RDF from Tabular Data on the Web Recommendation

Tracking #: 2659-3873

Authors: 
Steve Baskauf
Jessica K. Baskauf

Responsible editor: 
Guest Editors KG Validation and Quality

Submission type: 
Full Paper
Abstract: 
The W3C Generating RDF from Tabular Data on the Web Recommendation provides a mechanism for mapping CSV-formatted data to any RDF graph model. Since the Wikibase data model on which Wikidata is built can be expressed as RDF, this Recommendation can be used to document tabular snapshots of parts of the Wikidata knowledge graph in a simple form that is easy for humans and applications to read. Those snapshots can be used to document how subgraphs of Wikidata have changed over time and can be compared with the current state of Wikidata using its Query Service to detect vandalism and value added through community contributions.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jakub Klimek submitted on 11/Feb/2021
Suggestion:
Reject
Review Comment:

The paper describes an approach to representing snapshots of small subgraphs of the RDF representation of Wikidata in CSV files. The files are accompanied by a standard descriptor using the W3C Generating RDF from Tabular Data on the Web Recommendation (CSVW) ensuring that the subgraph can be automatically reconstructed from the CSV files. The targeted use cases are archival of the snapshots and version comparison, either for detecting and reverting vandalism or detecting data added by the community. The presented approach is motivated by the simplicity, readability, editability of CSV files in common spreadsheet editors and their versioning in common version control systems.

While the paper is easy to read and follow, the contribution itself seems weak, loosely specified, and unconvincing for a full research paper to be published in the SWJ.

My main concerns are regarding the originality and significance of the results.

The CSVW representation of the monitored Wikidata RDF can be viewed as any other RDF serialization, being the most similar to JSON-LD with a custom context, which too maps a regular JSON structure to RDF. The only difference here is the base format, which is CSV. Yet more than a half of the paper deals with the introduction of the Wikidata RDF data model, and a fairly straightforward approach to how a subset of the data model looks like in CSVW. Moreover, Section 3 describes different ways the CSVW serialization can be used as any other RDF serialization, raising even more questions about the advantages of storing the data in CSVW.

Finally, the use cases in Section 4 could be served by any other RDF serialization as well, and possibly even more conveniently, as most of them start with deserialization of the CSV files into an RDF graph anyway, e.g. to be loaded into a triplestore and queried using SPARQL.

The only advantage of storing the data in CSV could be warranted by a strong CSV oriented use case, but none can be found in the paper. On the contrary, from the CSV examples in the appendix it is clear, that even though such CSV files can definitely be loaded in a spreadsheet editor, they would still be very hard to read, not to mention write by human users.

Another aspect discussed in the paper is the file size of the CSVW serialization compared to Turtle. When compressed, the authors claim that the CSVW serialization is half the size of the Turtle one. However, the approach itself is limited to small subgraphs anyway, so size should not matter much. In addition, if it should matter, it would need to be compared at least to HDT, a binary RDF serialization greatly reducing RDF dump file sizes, while maintaining basic searchability.

Regarding the quality of writing, the paper is completely missing a related work section where it would be compared to other existing approaches that could be used for the mentioned use case. For instance, regarding version control, there are approaches to version RDF directly, see https://github.com/AKSW/QuitStore or https://github.com/rdfostrich/ostrich which could be used instead of CSVW.

Also missing is any kind of evaluation or user feedback. E.g. was the approach used in practice? Did the users appreciate the data being in CSV?

Overall, the paper reads like something between a vision paper and a demo paper rather than a full research paper to be published in a journal.

More major issues:
1. A GUI tool is mentioned in section 3, however, no screenshot is supplied, so it is hard to imagine how usable the tool might be
2. There is a performance evaluation on page 8 stating that the conversion with rdf-tabulator was done under macOS 11.0.1 using a 2.3 GHz quad-core processor with 8 GB memory. If this evaluation was to be interpreted as reproducible, this system specification is insufficient. The exact type of the processor is unspecified, While the frequencies of processors remained roughly the same for more than 10 years now, there are significant changes among their generations and manufacturers. Also, the type of the storage unit is unspecified. When working with files on disk, this is again significant information. On the other hand, with more and more machines running in the cloud, it is often both impossible and unnecessary to specify the exact configuration. I would suggest the authors present the time difference in percent rather than exact time in seconds, partially avoiding these issues.
3. When writing about the compression of dumps, it would be necessary to provide exact algorithm settings. “Compressed as .zip” is insufficient.

Minor issues:
1. In abstract - “is built can be expressed” - needs rephrasing
2. I would suggest syntax highlighting JSON parts of the paper
3. It is rdf-tabular, not rdf-tabulator

Review #2
By John Samuel submitted on 15/Feb/2021
Suggestion:
Major Revision
Review Comment:

The authors in the article “Managing linked open data in Wikidata using the W3C Generating RDF from Tabular Data on the Web Recommendation” propose tabular snapshots for subgraphs of Wikidata to ensure human readability and to document possible changes over time. This possible evolution can be later used to comprehend the community contributions over time and to detect vandalism.

The article, composed of five sections and an appendix presents the readers the motivation, their proposed method based on the W3C recommendation, its application on Wikidata subgraphs, the possible applications, and the conclusion. The appendix gives some detailed examples which may help the readers further understand their proposed method.

The use of tabular data as an input to software has become a topic of research in both academia and industry. It is particularly interesting in the case of Wikidata, considering the contributors from different domains, who may not have expertise in linked data or formats like RDF. Some of the possible users include GLAM contributors (also stated by the authors). Hence, this article is of interest to the researchers and the semantic web community.

However, the authors have missed discussing other related works in this domain. This is a major shortcoming in the article. Researchers and Wikidata contributors have developed several tools for the detection of vandalism. Some of these research works have also been published. Wikidata also supports property constraints [1]. Wikidata has recently integrated ShEx (Shape expressions) [2,3] for RDF validation and can also be used to monitor possible changes, in particular the subgraphs. None of these works and possible approaches are mentioned in section 1 and section 4.

The authors quickly talked about QuickStatements format [4] in section 4. QuickStatements is one of the commonly used tools by Wikidata contributors and makes use of a tabular format. Based on my understanding of the paper, I feel that their proposed approach can be extended to support RDF generation from QuickStatements format. Another tool called OpenRefine [5] (used by GLAM) also supports tabular formats and can be used to create and update Wikidata entities. I feel that a discussion on some of the existing tools based on tabular formats may help the readers compare and understand the limitations of the existing approaches.

There are some minor remarks related to the article formatting. The SPARQL queries need to be labeled and a brief description should be added.

The article also misses an architecture diagram presenting an overall view of their proposed approach and development (section 3).

[1]: https://www.wikidata.org/wiki/Help:Property_constraints_portal
[2]: https://shex.io/shex-primer/
[3]: https://www.wikidata.org/wiki/Wikidata:WikiProject_Schemas
[4]: https://www.wikidata.org/wiki/Help:QuickStatements
[5]: https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine

Review #3
By Tom Baker submitted on 15/Feb/2021
Suggestion:
Accept
Review Comment:

This elegantly and carefully written paper describes a method for capturing snapshots of user-defined subgraphs of Wikidata and explains how these snapshots can be used both to detect changes (including vandalism) and to prepare updates to Wikidata.

The work applies the W3C Recommendation "Generating RDF from Tabular Data on the Web" (CSV2RDF), which describes how to annotate the columns of a CSV file in a separate JSON metadata description file. The JSON metadata is used for converting tabular data into RDF triples.

Section 1 provides a good summary of the Wikibase data model, and Section 2 clarifies that the CSV model quite reasonably considers less commonly used features of the Wikibase data model, such as ranks, lexemes, and normalized values, to be out of scope.

The reader appreciates how this paper fills in some details about how Wikidata works -- for example, about how unique identifiers for references and value nodes are generated by hashing their property-value pairs. (I'm curious whether "the same property" P31 really is the same if represented with two predicates in two namespaces.)

Section 3 explains how CSV tables can be used to manage data in Wikidata. Identifiers and hashes can be used to identify where data has remained unchanged and should therefore not be written back to Wikidata. CSV is attractive as an archival format because it is compact yet transformable into RDF. Archived snapshots can be compared among themselves, or with Wikidata, using the SPARQL keyword MINUS.

It is hard to find fault with this tightly argued paper or with the approach presented. To help the reader, "CSV2RDF metadata description JSON" and "metadata description JSON" (both awkward) and "CSV2RDF metadata description" could perhaps be referred to consistently as the "JSON metadata description file" (or even just "JSON metadata"). Some constraints only emerge in the context of examples -- that the language of literal values in a column cannot be expressed with variables or that one table column cannot be used to generate more than one triple (necessitating post-processing with SPARQL CONSTRUCT). The paper might state, up-front, how the CSV2RDF specification and the flat, two-dimensional model of a CSV table both enable and constrain the design presented here. What, in a few bullets, does this approach get from the CSV2RDF spec? What, in a short paragraph, are the trade-offs of designing for a "simple CSV"?

Because tabular data represents values in a single flat table, the approach described here is best suited for handling relatively small subgraphs with a manageable number of properties, references, and qualifiers; that are smaller in terms of rows and columns; and that translate into less than one million triples. The same sorts of constraints apply to the Wikibase data model supported -- a usefully simplified subset of a model which, in its full complexity, would be hard to shoehorn into a two-dimensional table.

The bibliographic references look good, and the authors acknowledge input from an editor of the CSV2RDF specification, so the paper looks both solid and original -- a sensible approach that builds on existing work more than on surprising new insights (which is fine), and an important contribution to a growing literature about bridging the gap between Linked Data technologists and "spreadsheet-enabled" users.

Review #4
By Andra Waagmeester submitted on 15/Feb/2021
Suggestion:
Major Revision
Review Comment:

The authors describe an interesting approach for documenting and driving data contributions to Wikidata. The data frame as input to the semantic web. Other initiatives that come to mind are 1. https://github.com/dcmi/dctap/blob/main/TAPprimer.md, 2. https://github.com/johnsamuelwrites/ShExStatements, 3. https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing and 4. https://www.wikidata.org/wiki/Help:QuickStatements.

Each of these examples shares a commonality in that they aim at facilitating data input/donation to Wikidata or RDF. I would suggest the authors to compare their work against these other data frame based frameworks, where do they align and where do they differ in their respective approaches?

There are three points regarding this manuscript.
1. The gist of the paper is about data entry in Wikidata. Wikidata indeed supports RDF/SPARQL, but lacks the support of INSERT or UPDATE queries, meaning that there is no direct way to ingest RDF to Wikidata nor Wikibase. All data input goes directly through the API, which means that the RDF generated by the workflow described in this paper would need to be transformed to JSON that is needed to submit data to Wikidata's action API. How would the authors model this transformation? It seems out of the scope of CSV2RDF, not?

2. In the introduction the authors describe the "truthy" statement as a possibility to describe an item. The overall structure is accurately depicted in fig 1, except that there is nuance missing. The core Wikibase data model is rendered in RDF with some redundancy. The truthy statements are derived using a set of heuristics. e.g. according to the truthy-based query: https://w.wiki/zcp Russia is not a country, whereas rewriting the same query using the provenance subset of the RDF model (p: prefix) it is. The reason for this is that the "truthy" statements ignore statements that have the rank "deprecated" and for those statements that have a preferred rank, the normal rank statements are ignored as well. Russia has the statement that it is an instance of a country, but that statement is of normal rank. Its preferred-ranked statement describes it as being an instance of a "sovereign state". The authors do acknowledge the existence of ranks in the paper but argue that it out of scope for this paper. Given the role ranks have in deeming which statements belong to truthy, I would recommend describing those ranks together with the truthy statements.
Personally, I consider the truthy statements valuable in prototyping, i.e. they allow for quick and dirty querying, however, the provenance based should always follow in a data pipeline. I would recommend the authors to add the heuristics that differentiate between p: and wdt: edges in the graph and also how that is reflected in their pipeline.

3. With respect to validation, how does this work relate to the EntitySchema extension that was introduced to Wikidata in 2019?

Review #5
By Dimitris Kontokostas submitted on 14/Mar/2021
Suggestion:
Major Revision
Review Comment:

The authors present an approach for maintaining external datasets in sync with Wikidata in a loosely coupled way. Instead of adding the data directly in wikidata, the authors suggest maintaining the datasets in CSV files and using the CSV2RDF specification to map the data in the Wikidata RDF format. This approach allows the dataset editors to keep control of their data and use a very simple (excel-like) form for editing their data. The CSV2RDF export enables the mapping of the dataset to the wikidata RDF schema and to compare against data from the wikidata sparql endpoint. This allows the detection of new values that are missing from the external dataset, values that can be added to wikidata as well as changes in values that are present in both datasets. In the later case, the external dataset can be also used for detecting vandalisms.

The paper provides a detailed overview of the wikibase RDF export model as well as the limitation of the CSV2RDF mappings to map a simple CSV file to that model. The authors argue about the added value of keeping the datasets out of wikidata and using open standards to keep them in sync as well as possible new use cases. They also provide detailed examples of mappings, SPARQL queries and scripts they used for this work.

Personally, I like the idea of allowing dataset authors to maintain control over their data and using references to wikidata items for reconciling data against the wikidata knowledge graph. Creating a wikidata community recommendation for maintaining external datasets in sync with Wikidata will provide a lot of added value, especially if we can use open standards for that approach. However, I have the following comments on this paper

1. The title: "Managing linked open data in Wikidata using [...]" indicates that the external data are managed directly in wikidata while this is not the case. Maybe choosing a different wording could help make the indent more clear.

2. The authors deal with a mapping between two models, a very complex one from Wikibase and a very simple one from CSV. The authors describe how we can map the simple model to the complex one. This process is lossy as the rich model always has more details.
Besides the data that might be missing in the CSV file, the authors also describe limitations in the CSV2RDF mappings that do not allow a complete mapping to the wikidata statements. They solve this problem with additional SPARQL CONSTRUCT queries to add the missing triples. This is not an easy process as wikidata statements get unique URIs which also need to be stored in the CSV to get a one-to-one mapping.
I would be interested to read the authors argue about mapping in the opposite direction. For example, for every CSV dataset, create a SPARQL query that returns data from the Wikidata Query Service in a tabular form and matches the structure of the original CSV file. In that case, instead of comparing the data in RDF, we would be comparing CSV rows and cells. Of course, I am not suggesting that this is a better approach, but some argumentation (benefits or drawbacks) about this is missing from the paper.

3. It would be valuable to list some related work on syncing external dataset with Wikidata. If there are no related papers, some examples of external datasets that are integrated with Wikidata would help us see how the community is currently handling such cases.