On assessing weaker logical status claims in Wikidata cultural heritage records

Tracking #: 3569-4783

Authors: 
Alessio Di Pasquale
Valentina Pasqual
Francesca Tomasi
Fabio Vitali

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Abstract: 
This work presents an analysis of the use of different representation methods in Wikidata to encode information with weaker logical status (WLS, e.g. uncertain information, competing hypothesis, temporally evolving information, etc.). The study examines four main approaches: non-asserted statements, ranked statements, non-existing valued objects, and statements qual- ified with properties P5102:nature of statement, P1480:sourcing circumstances and P2241:reason for deprecated rank. We analyse their prevalence, success, and clarity in Wikidata. The analysis is performed over cultural heritage artefacts stored in Wikidata divided into three subsets (i.e. visual heritage, textual heritage and audio-visual heritage) and compared with astro- nomical data (stars and galaxies entities). Our findings indicate that (1) the representation of weaker logical status information is limited, with only a small proportion of items reporting such information, (2) the representation of WLS varies significantly between the two datasets, and (3) precise assessment of WLS statements is made complicated by the ambiguities and overlap- pings between WLS and non-WLS claims allowed by the chosen representations. Finally, we list a few proposals to simplify and standardize the representation of this type of information in Wikidata, with the hope of increasing its accuracy and richness.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Maximilian Marx submitted on 11/Feb/2024
Suggestion:
Minor Revision
Review Comment:

I thank the authors for addressing my comments on the earlier version
of this paper, in particularly the refactored code and the separation
of “novalue” and “somevalue”. However, there also lies my last
remaining gripe with the paper:

The authors claim that “the RDF representation of Wikidata uses blank
nodes for both unknown and non-existing values” (p8, l10f). Indeed,
the RDF representation does not use blank nodes for non-existing
values. Only unknown (“somevalue”) values are turned into blank nodes,
non-existing (“novalues”) values do not create additional values, but
are represented by making the statement node (and, for asserted
claims, the entity itself) an instance of a class “wdno:P???”, where
the “???” corresponds to the relevant property id, cf. the RDF Dump
Format specification [0]. This needs to be clarified (there are
further references to “novalue blank node[s]” on p9, l16; on p15,
l31f; and on p16, l41). More egregiously, Listing 7 does not actually
show a “novalue” claim, but rather a “somevalue” claim, so another
example is needed (the modelling shown there is also not, as claimed,
incorrect, since it indeed uses an “unknown value”).

Moreover, “Since these two methods should be employed alternatively,
this co-occurrence on the same properties might indicate that
annotators are using these two types of blank nodes imprecisely” (p15,
l31f): For at least some of the properties (such as “publisher”), it
might be legitimate to state that some work does not have a value for
this (i.e., a work that was not published does not have a publisher
and would warrant a “novalue” claim), whereas for other works the
publisher is merely not known (which could warrant a “somevalue”
claim). On the other hand, for, e.g., “creator”, it is hard to imagine
a situation where a “novalue” might be legitimate. Maybe some more
differentiation is required here?

Lastly, “Even though Wikidata focus on established knowledge
(community consensus), rather than conjectural or controversial
information […]” (p17, l25f): ultimately, Wikidata is a secondary
database, not with the goal to encode all the true facts in the world,
but rather to collect and reference the facts claimed elsewhere
[2]. This is, of course, not to say that uncertainty of claims need
not be represented in Wikidata (on the contrary), but might provide
some limited insight into why Wikidata has comparatively low WLS
claims: I would imagine that finding a reference that some claim is,
e.g., disputed, is rather more difficult than just finding references
for plain facts.

All in all, I am quite happy with the improvements made to the paper
and am confident that the remaining issues can be successfully
addressed in a minor revision.

Minor comments:
- p3, l17: “state of the art (2)” ~> “state of the art (section 2)”
- p4, l3: “have been imported into”: more accurately, have been linked
to the RKD data set; the original description may have been imported
from elsewhere.
- p4, l20: “indicate type” ~> “indicate the type”
- p5, l51: footnote 7 is broken, the link should go to
http://www.wikidata.org/wiki/Help:Statements instead. Several
further footnotes are also affected, see below.
- p7, l20: remove “http://www.wikidata.org/entity/Property_talk:P2241
- p7, l49: footnote 14 is broken (“/entity/” ~> “/wiki/”)
- p8, l3: around the reference to footnote 17, a closing parenthesis
is missing
- p8, l46: a better footnote 17 might be
https://www.wikidata.org/wiki/Q86719099, or either of the two
values for the “described at URL” property
- p10, l48: footnote 26 is broken (“/entity/” ~> “/wiki/”)
- p17, l48: footnote 51 is broken (“/entity/” ~> “/wiki/”)
- p18, l51; “and are not” ~> “not being”
- p19, l44: “assigned a accepted” ~> “assigned an accepted”
- p20, l10: “represent the unknown value” ~> “represent the
non-existing value”

[0] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Novalue
[1] https://www.wikidata.org/wiki/Q11981626
[2] https://www.wikidata.org/wiki/Wikidata:Verifiability

Review #2
By Daniel Hernandez submitted on 25/Feb/2024
Suggestion:
Major Revision
Review Comment:

This paper addresses four research questions regarding Wikidata statements the authors called weaker logical status (WLS) claims, which are statements that are not (strongly) true. Questions RQ1 to RQ3 are about the current state of WLS claims in Wikidata, whereas RQ4 is about improving this state.

I appreciate the discussion of the different types of Wikidata statements in Section 3. The examples show some implicit rules in the mapper from the Wikibase data model to the RDF model. However, I cannot recommend the paper in its current state because its scientific contributions are unclear.

* Relevance

Wikidata has defined several ways to add metadata over statements. However, semantics still need to be provided. Hence, advances in this regard are relevant.

* Quality

** Mayor issues

What is the contribution of this paper? If the paper's contribution is the answer to the research questions, then why is the state-of-the-art section not describing how other works contribute to answering these questions? Instead, the current state-of-the-art section describes how other projects are representing WLS claims (which is relevant to understand the WLS problem but is not state of the art regarding the paper contribution).

I have concerns regarding the research questions.

RQ1 can be divided into two questions. The first question is how widespread are each of the approaches of WLS claims the author identify. That is, to answer this question, we need to count the existence of WLS claims. The second question is how useful is each of these approaches. This question is not clear because it is not specified when an approach can be considered successful and when it is unsuccessful.

The author's answer to RQ1 is that "Wikidata seems doing poorly." They say that this is because they expect more WLS claims when comparing the Wikidata and the RKD regarding the number of attribution disputes. They found disputes in 0.4% of visual artworks in Wikidata, whereas in RKD 8.5% of artworks show disputes. The question is whether these datasets are comparable. Do they describe the same artworks? To my understanding, these two datasets do not include the same artworks. So, I do not agree with the conclusion of the authors.

RQ2 asks how the different approaches are used in two different fields. However, the term "approach" is inappropriate to classify the different types of WLS claims the authors describe. We can use the term approach to refer to one of several ways to give the same semantics to a statement. On the contrary, the different approaches that are described in this paper give different semantics to the statements. Thus, it results natural that different disciplines use different types of WLS claims.

RQ3 asks how clean and easy is to differentiate the applications of each approach to an actual weaker logical status versus another of the designed uses of that approach. This question is quite difficult to understand. First, the words clean and easy are ambiguous because the paper does not provide a metrics for them. Second, it is implicit in this question that an approach A may have several designed uses B₁, ..., Bₙ. The problem described in the question is then, given a statement S using approach A to determine which of the uses B₁, ..., Bₙ must be considered. I searched for the term "designed use" in the paper, and it only appears in the description of this question. Hence, I cannot understand what this question means.

RQ4 asks if is there a way to improve the findings regarding question RQ3. This is not a research question since. The possible answers are "no, there is no way" or "yes, there is a way." Is this question falsifiable? How can we prove that there is no way? After reading the answer given to this question, I can imagine that the question should say "how" instead of "is there". The answer is provided in Section 5. However, the recommendations in Section 5 are not supported by a scientific methodology. While some of these recommendations could be sensible and could be described in a position paper, they are not research findings. Moreover, some of them are vague. For example, "Provide simple-to-use interface widgets." What is simple-to-use?

** Minor issues

Page 1. The abstract finding (1) says, "the representation of weaker logical status claims is limited" but it should say "presence" instead of "representation." The word "representation" can be understood as the way the knowledge is represented. In the abstract findings (2) and (3), the word representation is ambiguous because it may mean how the knowledge is represented or how many statements have weaker logical status.

Page 1, line 40. The sentence "representation methods that allow to encode complex structures much beyond factual descriptive metadata" is difficult to understand. What is mind with "complex structure"? What is non-descriptive metadata, and what is non-factual metadata?

Page 17, line 23. "in each" → "is each."

Review #3
By Michael Piotrowski submitted on 29/Feb/2024
Suggestion:
Minor Revision
Review Comment:

The authors have responded in detail and revised the paper quite thoroughly. I appreciate the detailed responses, and I find revised version a definite improvement. I think the revised version can now be considered for publication as a journal paper.

Unfortunately, there remain a number of typographical and linguistic issues; most of them are minor, but there are too many for publication.

Throughout:

- The spelling still mixes American and British (e.g., “standardize,” “minimize,” “homogenize,” “summarize” vs. “summarise,” “modelling,” “colouring,” “endeavour”). Please use a spellchecker!
- Sometimes the Oxford comma is used, sometimes not. It should always be used (or never).
- For numbers, a decimal point rather than a comma should be used in English, and the comma as thousands separator.
- There are still places where a hyphen is used instead of an en-dash.
- Typographic quotation marks should be used.
- “i.e.” and “e.g.” are sometimes followed by a comma, sometimes not. They should always be followed by a comma.
- Please do not use reference numbers as nouns (such as “according to [2]”). This forces readers to look up the reference before continuing. I have listed the occurrences that I noticed below, but there may be others. With respect to the authors, I’ve used “et al.” for brevity below, but you may want to use “et al.” only when there are more than, say, three authors.

Individual issues (note that the specific reformulations are suggestions, but these passages need to be reformulated in any case):

- p1l40: “According to [2], …” → “According to Möller et al. [2]”
- p2l51: “research is frequently interpretative, and qualitative and the necessary proof” → “research is frequently interpretative and qualitative, and the necessary proof”
- p3l17: “in the state of the art (2) relevant data sources KGs and data models are presented” → “in section 2 (State of the art), relevant data sources, knowledge graphs (KG), and data models are presented” (the abbreviation “KG” hasn’t been introduced before).
- p3l37: “Despite domain ontologies representing the cultural heritage domain hardly managing to integrate support for interpretation (i.e., hermeneutics) into their models [5], there are some exceptions [4, 14].” → “Although domain ontologies representing the domain of cultural heritage hardly ever integrate support for representing interpretations (i.e., hermeneutics) into their models [5], there are a few exceptions [4, 14].”
- p4l7: The relation of the footnote (and the URL) to the text is unclear.
- p4l12: “their usage e.g., by [22], who compared” → “their usage, for example by Hernández et al. [22], who compared”
- p5l46: “For sure” → “Note”
- p7l15: “Following the example from [38]” → “Following the example from Aljalbout et al. [38]”
- p7l20: The URL in the text is probably intended as a footnote
- p9l13: “As [37] suggests” → “As Patel-Schneider [37] suggests”
- p12l11: Use an en-dash instead of hyphen
- p13l31: “we see a much homogeneous distribution” → “we see a much more homogenous distribution”
- p14, figure 2: Make figure use the full page width; the labels should be at least the same size as footnotes
- p15l41: “the print "Races: Anteriel" star recently shifted” → “the print "At the Races: Anteriel" recently shifted”
- p20l8: simplifying this sentence would make it easier to understand: “A fourth pattern could be in theory allowable, that of a claim for which the only reported value is wrong, but no acceptable alternatives exist.” → “A fourth pattern could be potentially allowed, namely for claims for which the only reported value is wrong, but no acceptable alternatives exist.”
- p20l20: “First of all, very few statements are expressed using weaker logical status than could have been expected” → “First of all, the number of declarations expressed using a lower logical status is much lower than might have been expected”
- p20l21: “Second, the Wikidata data model, far from being too poor for expressing WLS claims, has been shown to provide, in fact, an overabundance of methods, but there seems to be a large overlapping in uses between themselves and also towards non-WLS applications.” → “Secondly, the Wikidata data model is far from being too poor to express WLS claims; in fact, it offers users an overabundance of methods, but their applications overlap, and they are also used for non-WLS applications.”
- p20l33: “We plan to publish such taxonomy with a proposal for mapping existing data points into such taxonomy to lose no information in the conversion.” → “We plan to publish this taxonomy with a proposal for mapping existing data points to this taxonomy so that no information is lost during conversion.”
- p21l3: Entry [9] is in all caps.