Review Comment:
The paper describes an approach to representing snapshots of small subgraphs of the RDF representation of Wikidata in CSV files. The files are accompanied by a standard descriptor using the W3C Generating RDF from Tabular Data on the Web Recommendation (CSVW) ensuring that the subgraph can be automatically reconstructed from the CSV files. The targeted use cases are archival of the snapshots and version comparison, either for detecting and reverting vandalism or detecting data added by the community. The presented approach is motivated by the simplicity, readability, editability of CSV files in common spreadsheet editors and their versioning in common version control systems.
While the paper is easy to read and follow, the contribution itself seems weak, loosely specified, and unconvincing for a full research paper to be published in the SWJ.
My main concerns are regarding the originality and significance of the results.
The CSVW representation of the monitored Wikidata RDF can be viewed as any other RDF serialization, being the most similar to JSON-LD with a custom context, which too maps a regular JSON structure to RDF. The only difference here is the base format, which is CSV. Yet more than a half of the paper deals with the introduction of the Wikidata RDF data model, and a fairly straightforward approach to how a subset of the data model looks like in CSVW. Moreover, Section 3 describes different ways the CSVW serialization can be used as any other RDF serialization, raising even more questions about the advantages of storing the data in CSVW.
Finally, the use cases in Section 4 could be served by any other RDF serialization as well, and possibly even more conveniently, as most of them start with deserialization of the CSV files into an RDF graph anyway, e.g. to be loaded into a triplestore and queried using SPARQL.
The only advantage of storing the data in CSV could be warranted by a strong CSV oriented use case, but none can be found in the paper. On the contrary, from the CSV examples in the appendix it is clear, that even though such CSV files can definitely be loaded in a spreadsheet editor, they would still be very hard to read, not to mention write by human users.
Another aspect discussed in the paper is the file size of the CSVW serialization compared to Turtle. When compressed, the authors claim that the CSVW serialization is half the size of the Turtle one. However, the approach itself is limited to small subgraphs anyway, so size should not matter much. In addition, if it should matter, it would need to be compared at least to HDT, a binary RDF serialization greatly reducing RDF dump file sizes, while maintaining basic searchability.
Regarding the quality of writing, the paper is completely missing a related work section where it would be compared to other existing approaches that could be used for the mentioned use case. For instance, regarding version control, there are approaches to version RDF directly, see https://github.com/AKSW/QuitStore or https://github.com/rdfostrich/ostrich which could be used instead of CSVW.
Also missing is any kind of evaluation or user feedback. E.g. was the approach used in practice? Did the users appreciate the data being in CSV?
Overall, the paper reads like something between a vision paper and a demo paper rather than a full research paper to be published in a journal.
More major issues:
1. A GUI tool is mentioned in section 3, however, no screenshot is supplied, so it is hard to imagine how usable the tool might be
2. There is a performance evaluation on page 8 stating that the conversion with rdf-tabulator was done under macOS 11.0.1 using a 2.3 GHz quad-core processor with 8 GB memory. If this evaluation was to be interpreted as reproducible, this system specification is insufficient. The exact type of the processor is unspecified, While the frequencies of processors remained roughly the same for more than 10 years now, there are significant changes among their generations and manufacturers. Also, the type of the storage unit is unspecified. When working with files on disk, this is again significant information. On the other hand, with more and more machines running in the cloud, it is often both impossible and unnecessary to specify the exact configuration. I would suggest the authors present the time difference in percent rather than exact time in seconds, partially avoiding these issues.
3. When writing about the compression of dumps, it would be necessary to provide exact algorithm settings. “Compressed as .zip” is insufficient.
Minor issues:
1. In abstract - “is built can be expressed” - needs rephrasing
2. I would suggest syntax highlighting JSON parts of the paper
3. It is rdf-tabular, not rdf-tabulator
|