Review Comment:
Subsetting of Wikidata is a worthwhile topic of inquiry and I feel like this paper does a good job of covering the options available to practitioners.
The papers results seem novel though there appears to be overlap with some previously published similar work from similar sets of authors e.g. https://dblp.org/rec/conf/esws/BeghaeiraveriGM21 or https://doi.org/10.37044/osf.io/wu9et. This is the first work I can find that discusses multiple subsetting technologies. I left the paper feeling better equipped to decide what technologies to use when subsetting wikidata. The writing is generally good with some small grammatical errors.
My general view is that the paper could be more precise in a few areas but that if that is cleaned up the paper provides a valuable overview of Wikidata subsetting technologies.
I’m breaking my detailed notes into three sections: major notes, minor notes and nitpicks.
Apologies for the great length of this feedback.
Major Notes:
The authors assert that the definition of “truthy” on Wikipedia is “values whose statements are not deprecated are called Truthy.” This, I believe, is incorrect and I worry this misunderstanding has seeped its way into the results of the paper. The definition of truthy is given at https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Truthy_... and is “if there is a preferred statement for property P2, then only preferred statements for P2 will be considered truthy. Otherwise, all normal-rank statements for P2 are considered truthy. “
This leads to my second concern about the mismatches in results between the various subsetting algorithms. I am a little concerned that more interest wasn’t taken in determining the origin of these mismatches. I would like to see a few examples of mismatches identified and discussed. The paper uses language like “There might be lots of data instances in the input dump that still have a deprecated rank “ when it would be easy enough just to compare the outputs. 95% accuracy is good but it really seems like a task like this could justify 100% accuracy. Is it possible that these aren’t errors in subsetting but instead just misunderstandings of what the subsetting configuration means between the different tools?
You write “These items have been counted in the C program but are not extracted by any tools (Changing the regex in the C program to distinguish the deprecated rank has been unsuccessful as it requires more than 30 days of processing)” which makes me believe your program is currently totally ignoring statement rank? That seems like an oversight. In fact looking at the C script it appears that it misidentifies instances of, for example, disease. Look at https://www.wikidata.org/wiki/Special:EntityData/Q488700.json and note that it matches the regex used in the C script for disease (because the .* regex term is overly broad). Looking at the wikidata SPARQL service today I see less than 6000 instances of disease (P31 Q12136) but the paper puts the number at 12,309 (page 11 line 8). This seems unlikely to be due to statement rank or change over time (assuming I interpreted the table in the paper correctly).
In looking at the python script for the performance experiments it seems that a number of relevant parameters were selected for KGTK’s operation (see https://github.com/kg-subsetting/paper-wikidata-subsetting-2023/blob/mas...). As far as I can tell these parameters would have a major impact on performance and are not mentioned in the paper. I would like to know how these parameters impact the result of the performance experiments or at least how you arrived at the parameters that you chose for the experiment. In particular you selected what appears to be 6 cores of parallelism but your machine supports 32 threads. Why?
The analysis done for table 8 confuses me. First it is missing data and I’m not sure we can accept a paper missing data? In particular the data for 2020/2021 seem like they represent a critical region. But more importantly I’m not sure why the analysis was done going back to 2015. The particular numbers being reported seem arbitrary (whether a statement has a P854 reference isn’t particularly meaningful) and I’m not sure what conclusions can be drawn. Or do the authors disagree and think this analysis is meaningful? In which case I would think the paper should mention conclusions that can be drawn from the table (which couldn’t be drawn from just a single column or row).
I do not understand section 5.3 more generally. You describe schema 2 as “not referenced instances” but the actual definition seems to be “no matter whether the instances of (P31) fact has been referenced or not” (either referenced or not). Query 2 has the same issue.
Minor Notes:
The most straightforward method of subsetting would seem to me to be to stand up your own instance of blazegraph and use SPARQL Construct queries as mentioned in the paper. This would sidestep the timeout problem you mentioned with the WMF run query service. I’m not clear on why this approach wasn’t included amongst the “practical solutions”. I believe (though am not sure) there are even some community-run blazegraph servers that do not have timeouts. The paper notes that “Moreover, recursive data models are not supported in standard SPARQL implementations “ but I’m not sure what practical restrictions that places on subsetting (maybe this is a severe restriction which explains why it is not practical).
The paper notes “The higher missed items and statements in KGTK output might be due to multiple indexing and format conversion steps.” Is this suggesting that KGTK includes bugs that are creeping in between the format conversion steps? That seems like a serious problem for KGTK if true.
On the same subject, the paper suggests that the inconsistency could be due to “inconsistencies and syntax errors in the input dumps” but is that so? I would want to see an example of syntax errors in the dumps because if it’s true it’s something that needs to be fixed by the WMF. If it’s not true then we need new reasons for the inconsistencies.
The paper notes that “About half of the extraction time in KGTK is spent converting Wikidata into TSV files”. To me this suggests that this operation could/should be performed once centrally (maybe by the WMF or the community) and then all subsequent subsetting operations would be doable using KGTK at speeds faster than WDF. I would like to see the time for the preprocessing broken out separately.
The paper mainly focuses on the performance of subsetting approaches but the technologies it evaluates support very different kinds of subsetting. Maybe it’s outside the scope of this paper but I would like to see some discussion of whether each of the tools supports common subsetting workflows. Can we say something (even qualitatively) about how many common subsetting tasks each of the evaluated tools supports? Is the analysis done in 5.1, 5.2 and 5.3 an indicative example of common workflows? Or are they merely notional/synthetic.
Is table 6 needed for the paper? If I understand the paper right then this analysis was only performed using WDSub. I would have rather seen the analysis repeated with different tools than having done the analysis from 2015 to the present. I don’t know what I’m to conclude from table 6 as it is currently presented in the context of the paper.
The extracted subsets of data were not available for me to evaluate as they have not yet been uploaded to Zenodo.
Nit Picks:
The SWJ manual of style says to use sentence case for the title of the paper.
Mention of Blazegraph on page 4 line 33 could reference the original paper https://www.taylorfrancis.com/chapters/edit/10.1201/b16859-17/bigdata%C2...
Manual of style for SWJ says that “The use of first persons (i.e., ``I'', ``we'', ``their'', possessives, etc.) should be avoided, and can preferably be expressed by the passive voice or other ways
On page 8, the word “for”/”as”/”in” is missing in the table caption
On page 11, “Comparing amongst tools, we can see that WDF extracted slightly higher items and statements” do you mean “slightly more items”?
On page 4, the line between Q183 and the 100 km/h node is missing (only the arrowhead is present)
Page 15 line 37, I don’t understand the phrasing you used here.
General grammatical errors (e.g. I can’t parse the sentence “By WDF and WDumper, one can choose whether to ignore labels and other textual metadata along with the selected item” on page 14 line 43)
In the Abstract you write “%95” when I think you mean “95%”?
|