RQSS: Referencing Quality Scoring System for Wikidata

Tracking #: 3326-4540

Authors: 
Seyed Amir Hosseini Beghaeiraveri
Alasdair J G Gray
Fiona McNeill

Responsible editor: 
Maria Maleshkova

Submission type: 
Full Paper
Abstract: 
Wikidata is a collaborative multi-purpose knowledge graph with the unique feature of adding provenance data to the statements of items as a reference. About 73\% of Wikidata statements have provenance metadata, but there are few studies on the referencing quality in this knowledge graph, with existing studies focusing on relevancy and trustworthiness. While there are existing frameworks to assess the quality of Linked Data, there are none focused on reference quality. We define a comprehensive referencing quality assessment framework based on Linked Data quality dimensions. We implement the objective metrics of the assessment framework as the Referencing Quality Scoring System - RQSS. RQSS provides quantified scores by which the referencing quality can be analyzed and compared. RQSS scripts can also be reused to monitor the referencing quality regularly. Due to the scale of Wikidata, we have used well defined subsets to evaluate the quality of references in Wikidata using RQSS. We evaluate RQSS over three topical subsets: Gene Wiki, Music, and Ships, corresponding to three Wikidata WikiProjects, along with four random subsets of various sizes. The evaluation shows that RQSS is practical and provides valuable information, which can be used by Wikidata contributors and project holders to identify the quality gaps. Based on RQSS, the overall referencing quality in Wikidata subsets is 0.58 out of 1. Random subsets (representative of Wikidata) have higher overall scores than topical subsets by 0.05, with Gene Wiki having the highest scores amongst topical subsets. Regarding referencing quality dimensions, all subsets have high scores in accuracy, availability, security, and understandability, but have weaker scores in completeness, verifiability, objectivity, and versatility. Although RQSS is developed based on the Wikidata RDF model, its referencing quality assessment framework can be applied to knowledge graphs in general.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Jul/2023
Suggestion:
Major Revision
Review Comment:

This submission proposes RQSS: a reference scoring framework for Wikidata. The framework is designed to fill a pretty clear gap in existing work on analyzing quality of Wikidata and other LD sources, where references have typically been out of scope. The work does this in a comprehensive way, with 40 metrics and 22 aspects, belonging to 6 categories from prior work. The reference framework is evaluated on three topical subsets from Wikidata and four random subsets, showing certain indications of strengths and weaknesses of Wikidata in terms of its reference coverage.

This work is novel, and the paper is carefully written and well-structured. The work is a natural follow-up on prior theoretical and practical work of investigating Wikidata references, and it deliniates the contribution with respect to prior efforts. Given all of this, I expect that the paper will be pretty impactful and inform future work on Wikidata/linked data quality and provenance. The subset creation is described in enough detail and the code is provided for reproducibility. Finally, the results mimic the metrics, and show the findings for each metric in turn, which is systematic and fits the reader expectations.

I have three main concerns with this paper, which are somewhat interrelated.

First, each of the metrics naturally provides a plausible instantiation of a dimension, and for every metric, higher score means higher quality. So far this is fine. The problem appears when these metrics are being aggregated in the results, which is mixing apples and oranges. Averaging two metrics that are computed in entirely different way is meaningless and just confusing (as an analogy, note that we don't average F1-score and accuracy). In my view, simply presenting and discussing the results for each metric would already be super informative and it will still be sound. Averaging metrics within a category, and then all the categories in a single score, is not a sound methodology. Associating those averages with arbitrary weights is further problematic.

Second, while it is not a problem that a metric is used as a proxy for a quality dimension, it is a little confusing to use the metric and the dimension interchangeably. This is especially problematic when the metric is relatively weak - for instance, the reputation metric merely looks at blacklisted links, which the authors themselves note is a pretty weak proxy. As a side note for this metric, at least using the URL connectivity degree is already better. The authors also say several times that it is not possible to measure certain aspects; I would disagree, as dimensions like availability can certainly at least be approximated.

Third, the paper is just very long, and it reads like a report. I suggest that the authors spend some time to make both the metric description more concise (e.g., grouping the metrics within a category a little more) and similarly, make the results more concise. Compacting both of these will make the paper way more readable and informative. In the current state, the reader will likely get lost into the sea of metrics, assumptions, and numbers, with unclear takeaways in the end.

Minor:
* In figure 7, aren't the two Wikidata inputs in fact the same source?
* For metric 7, how do you know that your GT is complete?
* For some metrics, e.g., 17 and 19, it was unclear to me how you measure them concretely in practice.

Review #2
Anonymous submitted on 21/Sep/2023
Suggestion:
Major Revision
Review Comment:

1. Summary

The paper proposes a framework for judging the quality knowledge graph references. It is based on existing works and extends existing metrics to cover quality aspects for references. It is concretely applied on Wikidata. Objective metrics that do not need manual human checking were implemented and calculated for some topical and random Wikidata subsets. The results were then analyzed and lessons learned were reported.

2. Originality

The paper includes new aspects of reference quality that were not discussed by previous works, and provides a concrete application over Wikidata.

3. Overall open questions and weaknesses that need to be addressed:

o When something to verify goes beyond the information present in Wikidata itself e.g., judgment of external URLs, by verifiability metric or Freshness of Reference Triples, volatility ...no solution is provided
o Sometimes its not clear how to judge the measure (e.g., timeless, freshness.)
o Metric 22, 23, 27 difficult to understand, a concrete example could help
o one additional possible metric for completeness is the number of reference-properties available for a specific fact with respect to all possible reference-properties existing in Wikidata. you can define when the reference contains all properties (reference URL, retrieved, and other ...). maybe some reference has some properties but not all the ones needed
o section 5.1 (page 29): the information if you locally host a SPARQL endpoint by loading the partial dumps to a specific triple store is missing? and which triple store?
o section 5.1 (page 29): how do you guaranty that those random subsets include references?
o Paragraph in page 30 (however, ...ones) is difficult to follow should be reformulated
o An overview per Metric (like Table 5) would also be helpful
o Waiting more than 90 days: avoiding SPARQL queries and working directly on the dump would have taken less time. I suppose that you wanted to be flexible having any KG as input. which also means writing new queries, but at least an extension for the Wikidata pipeline could be to work directly on the dump and save time
o should the weights sum up to a specific value? a normalization may make sense e.g., all weights should sum up to 1
o “one possible weighting in the last column of Table 5”: you present rather the end average weight, but no information about the specific weight for each metric or dimension
o what is really the difference between section 5.2 and 5.3 in both you analyze the results: if section 5.3 is for going deeper into the details of the metrics you should adapt this title: seems that you also here exactly show how the specific metrics are calculated in Wikidata so make it clear from the title and the following introductory paragraph
o it is really not clear what is evaluated here (section 5.3 first introductory paragraph), the following subsections sound rather as a more detailed analysis of results. no evaluation of the system is performed
o subsections in section 5.3 here you mix about method and analysis (how you check the licensing should be part of the previous section where you explain the metrics) if you do it here for every metrics then the title of this section should be adapted and also it should be mentioned in the first introductory paragraph after 5.3
o Table 5, how are the metrics combined for each dimension?
o Section 5.3.2: but since we have a specific number of reference properties (i think), this number would be statistics overall Wikidata more or less ... either give an example to explain or give more arguments to the relevance and significance of this metric
o is “last modified” a sign of credibility do you have any references? (line 51, page 40)
o under “Schema-based Property Completeness of References” page 41: a concrete example will help understand here, as i know in Wikidata there is no distinction between schema level and instance level
o „Property Completeness of References “page 40: a concrete example would help here
o Fig.15: why for the random subsets some values are exactly the same?
o “establish the Wikibase docker containers due to the lack of root privileges on the server.” -> could you explain that more? was it a trial to set own Wikidata endpoint? what was the problem exactly? + add link to the docker
o What were the technical issues with Docker (lessons learned)?
o In the conclusion it is claimed that there is a lack of documentation to create local copies of Wikidata: documentations are already existing, maybe talk in the lessons learned about the actual problem that you faced trying to do so
o summary of metrics with low scores and elaborate more on Suggestions how to improve the quality of low metrics in Wikidata
o What is the effort of applying the framework on other KGs, what should be adapted ...--> future work (e.g., Interlinking dimension uses a property specific to Wikidata, to reuse it with other KGs this should be replaced)

4. Additional comments on the content of the paper that need to be addressed

- [Line 18, Page 1] "about" (in abstract) or "more than"(in introduction) 73% two different facts, settle on one
- [Line 46, Page 1] “... one reference (footnote)” the definition of a reference in Wikidata together with its components is missing here. (a screenshot will help the reader to know how references look like in Wikidata from the beginning)
- [Line 1, Page 2]. “multi-dimensional”: give at least 2 examples of the other dimensions not dealing with provenance
- [Line 2, Page 2]. either provide a very short definition of believability and verifiability here or as a footnote, or refer to the section if you define them latter
- [Line 3, Page 2] “portion of metadata” Example what is meant by reference metadata in the Wikidata context
- [Line 6, Page 2] “relevancy and verifiability”-> use the same terms as in the paper and say that they are synonym to the ones you use in the case it applies. In the papers they talk about: accessible, relevant, authoritative how they map to your terms: believability and verifiability
- [Line 10, Page 2] “referencing quality in linked data” how this can be generalized to other knowledge graphs? in what extent it is specific to Wikidata?
- [Line 13, Page 2] “statement level „since the term statement is somehow specific to Wikidata maybe a very short intro to the different components of a Wikidata entry, just before introducing the references or together using the same screenshot : https://www.wikidata.org/wiki/Help:Statements
- [Line 19, Page 2] “34 metrics” -> and how to deal with the rest 6 metrics? manual checking is always needed?
- [Line 41, Page 3] “relevant and autho..”the meaning of each of them should be defined , already in the intro (or in intro reference to here)
- [Line 10, Page 6] “has a human or machine-readable license” -> how this can be automatically checked since each reference website may have its own structure, and one cannot nowhere this info is available? is this really an objective metric?
- [Line 32, Page 6] why it's rarely applied?
- [Line 14, Page 9] “retrieved (P813)” -> but this property does not have a regular expression or a format constraint, it has other properties e.g., range constraint, Which ones are considered?
- [Line 36, Page 10] “reference URL (P854) and stated in (P248)” -> is there a list of considered predicates for Wikidata in your case?
- Should the ratio for metric 12, 17 18, 20, be high or low? how to interpret it? when it’s good or bad?
- [Line 41, Page 13], from where do you get the blacklist?
- [Line 5, Page 15] “detecting .. is challenging” -> what was the solution? are there any statistics how many reference are internal Wikidata resources and how many are external URLs?
- [Line 30, 39, Page 17] “difference” -> based on your formula, it is the fraction not the difference
- [Line 34, Page 18] where exactly can we find those schemas in Wikidata?
- [Line 1, page 21] are not normalized between 0 to 1 -> which ones? since its always a ratio, it’s expected to be between 0 and 1 or?
- [Line 39, Page 21] but at least the "retrieved" or "archive date" property will always be a literal or?
- Fig.6 could be already used in the introduction
- [Line 20, Page 22] “trained data” -> training data or trained model?
- [Line 36, 37, Page 22] “usage. references” -> was this fraction rather to show how often a specific reference property is used? another aim is stated by this sentence
- [Line 4, Page26] Last line of what
- [Line 5 Page 26]: “online availability is not feasible” -> why does that not mean to test if the web page is accessible?
- Change title 3.3 does not fit, maybe “summary of classification”
- [Line 38, 39, Page 27]: “metric targets” -> what is meant by metric target? and “the quality review is conducted”  which quality review
- Table 2, what is meant by source content
- Fig.7: add titles to each of your icons in the picture, add component names also to the picture, add parts of how subsets are created via Wdumper and also show the usage of SPARQL queries and http requests
- Fig.7 caption: “(which is based on the Wikidata data model).” -> Wikidata dump? entity schema -> meaning? historical data -> all data or what exactly. “performs” -> calculates
- [Line 40, Page 28], from where it can be downloaded
- [Line 43, Page 28], “directly from the Wikidata knowledge base.” the public endpoint?
- [Line 1, Page 29], “weights” -> what are default weights
- [Line 18, Page 29] “SPARQL endpoint”: if the extractor also needs an endpoint, why the icon used for extractor and metadata extractor is not the same (pay attention that while describing the tool, one had the impression that for the extractor a Wikidata dump in RDF Format was used ... revise that)
- [Line 20, Page 29] “Wikidata web pages” -> it is not clear in figure 7 use, another icon
- Table 5: some numbers are really exactly the same for different subsets why? + change to average ("overall" is not clear) and weighted average
- [Line 38, Page 31] “weighted average” -> which weights were used?
- [Line 40, Page 31] “change freq tags” -> what are those? reference is messing here
- [Line 44, Page 31] “schema definition” -> what is meant by that?
- [Line 48, Page 32] equivalents -> what does it mean?
- [Line 15, Page 35] “weights” -> should the weights be assigned per metric or per dimension?
- [Line 33, Page 34] equivalent property (P1628) -> equivalent property for what exactly? for the reference property e.g., "reference URL", a quick check shows that this reference property does not have any P1628 for example. Is this what you mean ?
- [Line 43, Page 36] From where do you get the information about the usage of bots? a reference would be helpful, used in a lot of parts to the text while analyzing the results
- [Line 38, Page 37] “13 affected” -> could you give an example
- [Line 42, Page 37] what is meant by Wikidata software?
- [Line 24, Page 39] datasets list -> how was it gathered?
- [Line 26, Page 41] “This package searches the root domain for XML sitemap file”  it is not clear from where one gets those XML sitemaps and how are they related to Wikidata
- [Line 41, Page 42] fact property -> was defined before?
- [Line 42, Page 42] “instance level” -> what is exactly meant by instance level (in Wikidata no distinction between schema and instance
- [Line 39, Page 50] “full dump locally”: you mean first host it in a local endpoint then query it
- [Line 40, Page 50] “expensive” -> how much?

5. Significance of the results

Results are calculated based on defined metrics and are significant for the specific subsets of Wikidata – it was also stated that it can be more or less be generalized to the whole Wikidata because of the nature of the chosen random subsets that have similar topic coverage as the whole Wikidata

6. Quality of writing

Good writing quality with the following comments on the text that need to be addressed:
- [Line 19, Page 1]. However, existing studies ...
- [Line 19, Page 1] relevancy and trusworth .. Of what? (not clear)
- [Line 20, Page 1] “assess the quality of linked data” try to be more specific, give an example quality aspect
- [Line 21, Page 1] “dimensions” give an example (e.g., ..)
- [Line 22, Page 1] “. RQSS”avoid redundancy e.g., the latter
- [Line 27, Page 1] „the overall referencing „overall or average ?
- [Line 33,34, Page 1] I do not see the relevance of the selected keywords. Maybe add some other significant keywords like "Provenance" , "Linked Data", "assessment framework" .
- [Line 42, Page 1] “Wikidata has” -> had
- [Line 44, Page 1] “reference footnote” add a normal reference instead at the end of your sentence
- [Line 48, Page 1] are researchers not human users? maybe just use the term end users here
- [Line 51, Page 1] Footnote 2 content: “updating” sounds not correct -> accessible instead
- [Line 5, Page 2] “[14], ..” -> and was extended ...
- [Line 10, Page 2] “in Linked data” -> “of …”
- [Line 12, Page 2] “some KGs” -> the abbreviation should be defined by the first mention of "knowledge graph" and only abbreviation should be reused by then, it occurs in a lot of places in the paper, check that e.g., also line 13 page 2 “knowledge graphs”
- [Line 12, Page 2] typo: DBpedia (also in other places in the paper) and reference to it should be added
- [Line 12, Page 2] “resource (item level)” give reference to see how its looks like in DBpedia or give an example here
- [Line 34, Page 2] „wikidata-driven“ What does driven mean here , maybe without
- [Line 36, 37, 38, Page 2] “this study … [2,14]” this sentence is repeated two time in this paragraph one time with references and one time without
- [Line 6,7, Page 3] “in which extent data represented to the data consumers” -> here is something missing
- [Line 13, page 3] Freebase, YAGO, Cyc -> need references
- [Line 22, Page 3] a historical -> an historical
- [Line 23, Page 3] “Wikidata … statements” -> this sentence should be split. Why they removed the statements, what is the relation of the dataset with the quality of Wikidata? this is not clear
- [Line 25, Page 3] why “Random” with capital letter
- Title 2.2 -> respect capital letter title rule
- [Line 31, Page 3] Typo author name „Wand „better use an automatic author referencing to avoid typos
- [Line 31, Page 3] reference after "dimensions"
- [Line 42, Page 3] “englsh”
- [Line 20, Page 4] which is, to the best of our knowledge, the most ...
- [Line 8, Page 5] enables
- [Line 30, Page 5] by the definition of all the metrics you write title in italic without point and directly start with “Consider” -> find a better way, this is for all the metrics e.g., new line and we consider the function ...
- [Line 32, Page 5] with status code 200
- [Line 42, Page 5] “ref URL … stated in” -> make as links, or add footnotes to link. (those should be already defined in the intro as suggested in previous comment)
- [Line 33, 37, 36, Page 6] references to SSL, TSL, and man-in-the-middle. Typos “cause the”, “connections to external”
- [Line 16, Page 4] “to equivalence” -> to an equivalent ...
- [Line22, Page 9] Figure 2 -> use \autoref {} instead of manual typing
- [Line 33, Page 9] “Which there is” -> for which ...
- [Line 48, Page 9] triples
- [Line 11, Page 10] “for inference” -> redundancy
- [Line 43, Page 10] “the the datatype”
- Fig.4 caption -> show
- [Line 27, Page 11]  the value of ...
- [Line 40, Page 13] -> by checking if the external ...
- [Line 4, Page 17] a function
- [Line 2, Page 18] long dimension ... -> extensive ?
- [Line 27, Page 20] clarifies
- Fig.6 caption : types
- [Line 16, Page 22] : Judging of
- [Line 48, Page 24]: « the getting use» -> sounds not correct
- [Line 3, Page 26]: wd:Q7094076 -> label would help
- [Line 5, Page 26]: automatic -> automatically
- [Line 24, Page 26]; expects, “occur” instead of existing, and population instead of populating
- [Line 19, Page 27] typo: function
- [Line 50 Page 28] performs -> calculates
- Table 4 caption: joint -> overlapping, also in Line 32 page 30: overlapped -> overlapping
- [Line 38, Page 30]: “of disjoint classes” -> that classes are disjoint
- Table 5: sorting the numbers by asc desc order would be helpful here
- [Line 32, Page 32] E-ids -> reference to meaning
- [Line 15, Page 35] more -> higher
- [Line 35, Page 33] confirms
- Figure 12 in text should be 10
- [Line 49, Page 37] XPath reference, and also SheX-C in page 41
- [Line 34, Page 43] reference completely missing
- [Line 49, Page 44] lesser -> lower
- [Line 39, Page 46] try to be consistent: sometimes labeling and some other times labelling
- Page 48 paragraph “However, high … Figure 18” written two times
- [Line 42, Page 50] missed -> missing, Line 44 studies -> study
- [Line 19, Page 51] gives
- [Line 28, Page 51] “evaluated RQSS” -> not evaluated, but used
- [Line 38, Page 51] amount-of-data -> consistency, sometimes with capital letters sometimes not
- [Line 11, Page 52] access day and month should be added, check all website references

7. Data file assessment

- The provided link to resources is a GitHub link that points to a specific release v1.0.1
- It contains a README that explains how to use the framework
- The first remark that should be addressed is that under “input/output” they talk about version 1.0.0 but the release is an upper version
- Screenshots from example plots of the “Presentation layer” should be added, together with example csv files of the runner output
- In addition “Output files of the RQSS extractor and framework on 3 Topical (Gene Wiki, Music, Ships) subsets and 4 Random Subsets” are provided on Zenodo (https://zenodo.org/record/7336208), together with the actual subsets (https://zenodo.org/record/7332161 )
- the provided resources appear to be complete for replication of experiments

8. Decision

Overall the paper makes a good impression. However, there are still a lot of the previously mentioned comments that need to be addressed (content and writing quality related). But I still consider those comments (even they may seem numerous) as feasible changes since they are in most of the cases requiring more information, explaining not clear passages, or argumenting some statements. The decision is: Major revisions required.