Wikidata Subsetting: Approaches, Tools, and Evaluation

Tracking #: 3372-4586

Authors: 
Seyed Amir Hosseini Beghaeiraveri
Jose Emilio Labra-Gayo
Andra Waagmeester
Ammar Ammar1
Carolina Gonzalez
Denise Slenter
Sabah Ul-Hasan
Egon L. Willighagen
Fiona McNeill
Alasdair J G Gray

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Abstract: 
Wikidata is a massive Knowledge Graph (KG) including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting -- WDSub, KGTK, WDumper, and WDF -- in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. The results show that all four tools have close and appropriate accuracy more than \%95. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, we defined multiple subset use cases and analyzed the extracted subsets, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Daniel Erenrich submitted on 21/Feb/2023
Suggestion:
Major Revision
Review Comment:

Subsetting of Wikidata is a worthwhile topic of inquiry and I feel like this paper does a good job of covering the options available to practitioners.

The papers results seem novel though there appears to be overlap with some previously published similar work from similar sets of authors e.g. https://dblp.org/rec/conf/esws/BeghaeiraveriGM21 or https://doi.org/10.37044/osf.io/wu9et. This is the first work I can find that discusses multiple subsetting technologies. I left the paper feeling better equipped to decide what technologies to use when subsetting wikidata. The writing is generally good with some small grammatical errors.

My general view is that the paper could be more precise in a few areas but that if that is cleaned up the paper provides a valuable overview of Wikidata subsetting technologies.

I’m breaking my detailed notes into three sections: major notes, minor notes and nitpicks.

Apologies for the great length of this feedback.

Major Notes:
The authors assert that the definition of “truthy” on Wikipedia is “values whose statements are not deprecated are called Truthy.” This, I believe, is incorrect and I worry this misunderstanding has seeped its way into the results of the paper. The definition of truthy is given at https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Truthy_... and is “if there is a preferred statement for property P2, then only preferred statements for P2 will be considered truthy. Otherwise, all normal-rank statements for P2 are considered truthy. “

This leads to my second concern about the mismatches in results between the various subsetting algorithms. I am a little concerned that more interest wasn’t taken in determining the origin of these mismatches. I would like to see a few examples of mismatches identified and discussed. The paper uses language like “There might be lots of data instances in the input dump that still have a deprecated rank “ when it would be easy enough just to compare the outputs. 95% accuracy is good but it really seems like a task like this could justify 100% accuracy. Is it possible that these aren’t errors in subsetting but instead just misunderstandings of what the subsetting configuration means between the different tools?

You write “These items have been counted in the C program but are not extracted by any tools (Changing the regex in the C program to distinguish the deprecated rank has been unsuccessful as it requires more than 30 days of processing)” which makes me believe your program is currently totally ignoring statement rank? That seems like an oversight. In fact looking at the C script it appears that it misidentifies instances of, for example, disease. Look at https://www.wikidata.org/wiki/Special:EntityData/Q488700.json and note that it matches the regex used in the C script for disease (because the .* regex term is overly broad). Looking at the wikidata SPARQL service today I see less than 6000 instances of disease (P31 Q12136) but the paper puts the number at 12,309 (page 11 line 8). This seems unlikely to be due to statement rank or change over time (assuming I interpreted the table in the paper correctly).

In looking at the python script for the performance experiments it seems that a number of relevant parameters were selected for KGTK’s operation (see https://github.com/kg-subsetting/paper-wikidata-subsetting-2023/blob/mas...). As far as I can tell these parameters would have a major impact on performance and are not mentioned in the paper. I would like to know how these parameters impact the result of the performance experiments or at least how you arrived at the parameters that you chose for the experiment. In particular you selected what appears to be 6 cores of parallelism but your machine supports 32 threads. Why?

The analysis done for table 8 confuses me. First it is missing data and I’m not sure we can accept a paper missing data? In particular the data for 2020/2021 seem like they represent a critical region. But more importantly I’m not sure why the analysis was done going back to 2015. The particular numbers being reported seem arbitrary (whether a statement has a P854 reference isn’t particularly meaningful) and I’m not sure what conclusions can be drawn. Or do the authors disagree and think this analysis is meaningful? In which case I would think the paper should mention conclusions that can be drawn from the table (which couldn’t be drawn from just a single column or row).

I do not understand section 5.3 more generally. You describe schema 2 as “not referenced instances” but the actual definition seems to be “no matter whether the instances of (P31) fact has been referenced or not” (either referenced or not). Query 2 has the same issue.

Minor Notes:
The most straightforward method of subsetting would seem to me to be to stand up your own instance of blazegraph and use SPARQL Construct queries as mentioned in the paper. This would sidestep the timeout problem you mentioned with the WMF run query service. I’m not clear on why this approach wasn’t included amongst the “practical solutions”. I believe (though am not sure) there are even some community-run blazegraph servers that do not have timeouts. The paper notes that “Moreover, recursive data models are not supported in standard SPARQL implementations “ but I’m not sure what practical restrictions that places on subsetting (maybe this is a severe restriction which explains why it is not practical).

The paper notes “The higher missed items and statements in KGTK output might be due to multiple indexing and format conversion steps.” Is this suggesting that KGTK includes bugs that are creeping in between the format conversion steps? That seems like a serious problem for KGTK if true.

On the same subject, the paper suggests that the inconsistency could be due to “inconsistencies and syntax errors in the input dumps” but is that so? I would want to see an example of syntax errors in the dumps because if it’s true it’s something that needs to be fixed by the WMF. If it’s not true then we need new reasons for the inconsistencies.

The paper notes that “About half of the extraction time in KGTK is spent converting Wikidata into TSV files”. To me this suggests that this operation could/should be performed once centrally (maybe by the WMF or the community) and then all subsequent subsetting operations would be doable using KGTK at speeds faster than WDF. I would like to see the time for the preprocessing broken out separately.

The paper mainly focuses on the performance of subsetting approaches but the technologies it evaluates support very different kinds of subsetting. Maybe it’s outside the scope of this paper but I would like to see some discussion of whether each of the tools supports common subsetting workflows. Can we say something (even qualitatively) about how many common subsetting tasks each of the evaluated tools supports? Is the analysis done in 5.1, 5.2 and 5.3 an indicative example of common workflows? Or are they merely notional/synthetic.

Is table 6 needed for the paper? If I understand the paper right then this analysis was only performed using WDSub. I would have rather seen the analysis repeated with different tools than having done the analysis from 2015 to the present. I don’t know what I’m to conclude from table 6 as it is currently presented in the context of the paper.

The extracted subsets of data were not available for me to evaluate as they have not yet been uploaded to Zenodo.

Nit Picks:
The SWJ manual of style says to use sentence case for the title of the paper.
Mention of Blazegraph on page 4 line 33 could reference the original paper https://www.taylorfrancis.com/chapters/edit/10.1201/b16859-17/bigdata%C2...
Manual of style for SWJ says that “The use of first persons (i.e., ``I'', ``we'', ``their'', possessives, etc.) should be avoided, and can preferably be expressed by the passive voice or other ways
On page 8, the word “for”/”as”/”in” is missing in the table caption
On page 11, “Comparing amongst tools, we can see that WDF extracted slightly higher items and statements” do you mean “slightly more items”?
On page 4, the line between Q183 and the 100 km/h node is missing (only the arrowhead is present)
Page 15 line 37, I don’t understand the phrasing you used here.
General grammatical errors (e.g. I can’t parse the sentence “By WDF and WDumper, one can choose whether to ignore labels and other textual metadata along with the selected item” on page 14 line 43)
In the Abstract you write “%95” when I think you mean “95%”?

Review #2
Anonymous submitted on 13/Mar/2023
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

Review #3
By Wolfgang Fahl submitted on 30/Mar/2023
Suggestion:
Minor Revision
Review Comment:

The paper Wikidata Subsetting: Approaches, Tools and Evaluation reports on subsetting the full Wikidata Knowledge Graph. The motivation that the full graph is too big to be useful and well handled in quite a few usecases is well described. Comparing the different options available for getting a reasonable subset of the data has not been done yet in systematic ways in the past. The results of this paper are helpful in making decisions on how to create wikidata subsets for individual use cases. The chosen example biomedical knowledge graph fulfills relevancy criteria such as number of entities and relations, number of instances involved number of institutions and people involved as stakeholders that impact the relevance of this paper positively. The paper is well written and outlines the problem and the four solutions are evaluated well.

The wikidata metamodel is explained in the terms most commonly used in the field. Especially Table 6 is helpful to show how the knowledge graph subset in wikidata is a "moving target" given how many stakeholders are involved in curating it.

Besides https://github.com/kg-subsetting/paper-wikidata-subsetting-2023 it might be worthwhile to hint to: https://www.wikidata.org/wiki/Wikidata:WikiProject_Schemas/Subsetting

The paper is very tool oriented and has some claims that might be challenged. I'd recommend to revise these claims:

1. "However to the best of our knowledge, there is no precise format definition for subsetting" Subsetting of knowledge graphs IMHO may be described in various ways e.g. using the Gremlin Graph Traversal Language (which IMHO should be mentioned for this). See Marko A. Rodriguez , The Gremlin graph traversal machine and language, ACM 2015, DOI 10.1145/2815072.2815073. E.g. the Full graph is simply g() in that language. g.v() gives you all vertices and g.e() all edges - the subsetting problem is IMHO comparable to the standard graph algorithms that visit nodes in a graph by certain criteria. Harsh Thakker's Work https://dblp.org/pid/43/11366.html is IMHO relevant here and it should be noted that the blazegraph used by wikidata originally had support for gremlin https://github.com/blazegraph/tinkerpop3 which has unfortunately not gotten much attention. Describing a Wikidata subset in terms of RDF queries is also theoretically possible although the merits for real world scenarios are very limited given the property cardinality problems in wikidata that easily make queries very inefficient see Fahl et al, Property cardinality analysis to extract truly tabular query results from Wikidata, Wikidata Workshop 2022, https://ceur-ws.org/Vol-3262/paper7.pdf.

2. "The subsetting discussion were started at the 12 international SWAT4HCLS" IMHO these discussions started earlier as soon as Wikidata was started with a million entries. Therefore the claim should IMHO be softened to something like "The subsetting discussion in the biomedical Wikidata community were started"

3. "We present the average and the standard deviation of the three runs". A statistic on on only three runs without chaning major parameters that might be interesting when trying to replicate the results IMHO does not make much sense. Parameters that might be changed are: - RAM - CPU speed - Type of Disk (rotating/SSD) - Latency and thruput of disk

At least it should be mentioned what these parameters are (6 TB HDD does not tell the access time in msec, or the thruput (e.g. 50 MByte/sec)

4. The "Timed Out" in Table 6 seems to be caused by using the official Wikidata endpoint. For this table multiple other public endpoints such as the QLever or Virtuoso endpoint are available and using e.g. Stardog it's or Jena it's possible to avoid these timeouts as outlined in Fahl et. al Getting and hosting your own copy of Wikidata, Wikidata Workshop 2022, https://ceur-ws.org/Vol-3262/paper9.pdf. The two values should IMHO be added using such endpoints.

5. WDF is the fastest tool and it can extract a subset in less than 4 hours

When using the N-Triples format dump each line in the dump corresponds to the triple. This allows for using standard unix pipe and filter tools such as "grep" and "awk" suitable for creating subsets. We have applied this technique in the past to get simple filter results such as "all scholarly articles with 'Proceedings of' in the title succesfully and the speed is almost as high as the thruput provided by the storage medium for this approach.

A simple example is:

date;wc -l latest-all.nt;date

which counts all triples. On our SSD disk the command takes less than 45 minutes for the 18 billion triples of the dump of March 29th 2023.

Wed Mar 29 08:38:00 AM CEST 2023 18120655263 latest-all.nt Wed Mar 29 09:22:50 AM CEST 2023

du -sm latest-all.nt 2698216 latest-all.nt The file is 2.7 TB in size and with a thruput of >100 MB/sec the command should take less than 1 hour

Here is a list of ideas what topics might be addressed in this or a followup paper:

Separation of model data and instance data

P31 (instance of) and P279 (subclass of) refer to model data and only need a few triples to be extracted - which ones are relevant might be prepared e.g. with the approach of the wikidata.bitplan.com tool check for the most often available properties before starting the subsetting.

Identifier skeleton versus full subset

If a subset's data is mostly available via other public sources and knowlede graphs an identifier "skeleton" might be enough

Further problem keywords:
- Relevance and Property Cardinalilty Problem
- Synchronization Followup Problem
- Versioning Followup Problem