Wikidata through the Eyes of DBpedia

Tracking #: 1389-2601

Authors: 
Ali Ismayilov
Dimitris Kontokostas
Sören Auer
Jens Lehmann
Sebastian Hellmann

Responsible editor: 
Aidan Hogan

<
Submission type: 
Dataset Description
Abstract: 
DBpedia is one of the earliest and most prominent nodes of the Linked Open Data cloud. DBpedia extracts and provides structured data for various crowd-maintained information sources such as over 100 Wikipedia language editions as well as Wikimedia Commons by employing a mature ontology and a stable and thorough Linked Data publishing lifecycle. Wikidata, on the other hand, has recently emerged as a user curated source for structured information which is included in Wikipedia. In this paper, we present how Wikidata is incorporated in the DBpedia eco-system. Enriching DBpedia with structured information from Wikidata provides added value for a number of usage scenarios. We outline those scenarios and describe the structure and conversion process of the DBpediaWikidata (DBw) dataset.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Thomas Steiner submitted on 24/Jun/2016
Suggestion:
Accept
Review Comment:

In this paper, the authors introduce the DBpediaWikidata dataset (DBw) and describe how it fits in the DBpedia ecosystem. The paper starts with a description of the two structured knowledge bases DBpedia and Wikidata as such, and then highlights their different data modeling choices. In continuation, the authors describe the data conversion process from Wikidata to the new DBw dataset, which they detail and evaluate thereafter. Finally, the authors outline the dataset's access and sustainability perspectives and describe a number of use cases. The paper closes with an outlook on future work and conclusions. It was submitted in the category of "Dataset Description" papers. The biggest question mark for me personally is around the core objectives of the dataset (see remark below) and maybe a lacking clear statement on the longterm vision regarding the position of DBpedia: do the authors want to see it merged into Wikidata, co-exist with it "forever", or do they have completely other visions? Concluding, I would recommend this paper for publication after the remarks below are being addressed.

In the following detailed remarks on content, grammar, typos, and typography & layout.

* Content

- ​The first paragraph of the Introduction is almost identical to the Abstract. Please rephrase or remove.
- The "Publication" item (page 1, right column) does not mention Wikidata, which breaks the pattern from the other items. Please add a remark on Wikidata.
- Page 2, left column: "[P]eople face difficulties when confronted with the young and still evolving Wikidata schema" [citation needed].
- Footnote 5 is missing (it appears in the caption of Figure 1, but is missing below).
- The objectives of the dataset are unclear, page 3, left column states: “DBpedia has a big community and there has been extensive tool development to explore and exploit DBpedia data through the DBpedia ontology. Thus, we add another bubble in the LOD cloud, which helps the semantic web and DBpedia community for an easier transition to Wikidata data.” This essentially says: DBpedia is well-established. In order to transition to Wikidata, we simply add another bubble to the LOD cloud. Please rephrase this to state the objectives clearer. There is also a clear statement missing on the longterm vision regarding the position of DBpedia: do the authors want to see it merged into Wikidata, co-exist with it "forever", or do they have completely other visions?
- For the conversion process on page 4, right column, regarding the Wikidata ontology, consider citing the very related paper "From Freebase to Wikidata" (http://research.google.com/pubs/pub44818.html, full disclosure: I am one of its coauthors).
- Page 5, right column, the reified statement IRI "dw:Qs_Px_H(Lv,5)" urgently needs an example. Is H defined somewhere? Can you justify the size limit of 5? Likewise the IRI splitting method needs an example, it is likewise unclear without concrete data to look at.
- Page 7, right column, the paragraph "There were more than 10 million requests" starts out of nowhere. Did a heading get lost?
- For the use cases on page 8, left column: the SPARQL query for reified statements is arguably not simpler than the DBw queries (apart from the more speaking labels). Can you outline the advantages more?
- For the conclusions on page 8, right column, a qualitative apart from quantitative analysis is needed to back the statement "According to the web server statistics the daily number DBW visitors range from 300 to 2,700 and we counted almost 30,000 unique IPs since the start of the project, which indicates that this dataset is heavily used." What kind of queries do people use it for, is there a general theme? Can you try to add some qualitative analysis to the traffic statistics?

* Typos (please use full text search to find the concrete occurrences)

- "Figure 2 depics" => depicts
- "we use OWL punning"(?) => punning, is this the intended word?
"The wikidata data model" => uppercase Wikidata consistently, several occurrences
"are provided separate" => separately
"frefuent" => frequent
"easier" => more easily

* Grammar (please carefully re-read the phrases and reformulate)

- Check "but exact value not known for the property"
- Check "Normalizing datasets to a common ontology is the first step towards data integration and fusion but most companies"
- Check "replaces the placeholder with a space the wiki-title value"
- Check "This lead to a simpler design"
- Check "IRI-splitted triples"
- Check "Following we list SPARQL query examples for simple and reified statements."
- Check "Converting a dataset to a more used and well-known schema, it makes it easier to integrate the data."

* Typography and Layout

- Remove spaces before footnote ("statements ²", "property. ⁶")
- Consistently use footnotes after periods like so: foo.⁶
- Uses nice quotes (search for "No value", it should be rounded quotes)
- No space around dash ("property - value pair")
- Move Listing 1 to the top.
- Figure 2: use correct character casing (SPARQL, JSON,…)
- Use em-dashes "- one for every mapping -"
- Don't split Listing 3
- dbo:WikidataSplitIri overflows
- Footnote 25 appears right after the line break, move it up to the previous line.
- Don't split Listing 5
- Reference 3 is broken (non-ASCII characters)

Full disclosure: this paper was reviewed by Google employee Thomas Steiner (tomac@google.com), involved in the Freebase to Wikidata migration efforts.

Review #2
By Heiko Paulheim submitted on 04/Jul/2016
Suggestion:
Major Revision
Review Comment:

The paper introduces a Linked Data version of Wikidata which is deeply linked to DBpedia. The authors claim that the usability of Wikidata will improve by serving its data closer to the ontology and standards used in DBpedia. While showing an interesting and probably useful dataset, the paper itself requires a careful rework, as it contains some imprecisions and doubtful or unsupported claims. Furthermore, it seems to be written in a bit of a haste, since it contains quite a few unnecessary language mistakes.

In the following, I provide a list of shortcomings of the paper. In sum, there are too many of those to recommend acceptance at this stage of the paper.

Doubtful/unsupported claims:
* On p.1, the authors claim that Wikidata is schemaless and comes without an ontology. At a first inspection, this seems doubtful. There are, e.g., symmetric [1] and transitive [2] properties in the Wikidata schema.
* Still on p.1, the authors claim that there is no qualitative and quantitative comparison between Wikidata and DBpedia. While not at detail level, a coarse-grained comparison is presented in [3].
* p.2: "people face difficulties when confronted with the young and still evolving Wikidata schema" - this is an unsupported claim, unless a user study or something similar is quoted. Furthermore, it might be useful to distinguish between ease of use for data consumers and data providers/editors.
* p.5: The DBpedia extraction framework already takes care of tthe correctness of the extracted datatypes during extraction" - it should be better explained what exactly is done here (do you refer to datatype or object properties? do you check for ontology conformance?) and how. A reference might help.

Imprecisions:
* on p.1, the publication model of Wikidata should be mentioned along with that of DBpedia
* 4.1.2 the value transformations should be explained using an example
* In 4.2 ("Validation"), I would rather talk about about "inconsistent triples" than "errors".
* p.5: "The wikidata model allows the same values as objects of different statements" - as of my understanding, this is in the nature of the RDF graph model anyways. The authors should explain that issue better.
* p.8: "It also part of short-term plan to fuse all DBpedia data into a single namespace" - I guess that this implies some problems w.r.t. identity and conflict resolution. The authors should at least acknowledge that this is not trivial.

Other remarks:
* The proposed equivalence relations introduced in 4.1.1 might break OWL DL conformance. Although this is not a show stopper, I would appreciate a small statement and discussion about this point.
* The same holds the use of reification described in 4.3. In Listing 4, dbo:startDate has a defined domain dbo:Event, so that the statement in the example can be inferred to be an event at the same time. Some discussion would be much appreciated.
* Table 6 should include labels of Wikidata properties
* p.8: "Although it is early to identify all possible use cases" -> I think this is impossible at any point in time

Missing sources:
* p.1, a source (e.g., spec link) for Named Graphs should be given
* p.3: "DBpedia has a big community" - a number would be more precise, a source for that number would be truly scientific
* p.3: "we add another bubble to the LOD cloud" - a reference to the current state of the LOD cloud would be appropriate
* p.4: add a reference to the DBpedia coordinates dataset

Language/formatting issues:
* overall: Wikidata is sometimes written with a capital "W", sometimes not
* p.1: "quantitavive"
* p.2: "As a result..." - this sentence is somehow twisted. Plus, it reads as if the improved usage is a result of the difficulties of Wikidata.
* p.2: "Wikidata is *a* cmmunity-created..."
* p.2: Footnote 5 is missing
* p.3: "The DBpedia Information Extraction Framework observed..." the DIEF is not the observing actor here.
* Fig.2: JSON should be capitalized
* p.5: "DBpedia has already" -> "already has"
* p.5: "When the value is an IRI" -> "*If* the value..."
* p.7: paoperties
* p.7: frefuent
* p.9: reference 3 has an enconding problem

[1] https://www.wikidata.org/wiki/Q18647518
[2] https://www.wikidata.org/wiki/Q18647515
[3] http://www.semantic-web-journal.net/content/knowledge-graph-refinement-s...

Review #3
By Denny Vrandecic submitted on 12/Jul/2016
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .

Review for
“Wikidata through the Eyes of DBpedia”
Ali Ismayilov, Dimitris Kontokostas, Soeren Auer, Jens Lehmann, Sebastian Hellmann

I pointed out when I was asked to do the review that I might be regarded (and possibly also actually be) biased in reviewing this paper, due to my relation with Wikidata. I only agreed because my review would be not-anonymous, published, and thus a possible conflict of interest will be obvious and the review would be open to public scrutiny. - Denny Vrandecic

The paper describes the work being done in order to add a Wikidata dump to the DBpedia datasets, and thus to provide the whole universe of Wikimedia related projects in a single, downloadable source, using a single ontology.

In general, the paper does not really address how the two datasets are complementary and I don’t have the feeling that the discussion of the advantages and disadvantages of the two datasets with regards to each other are sufficient and fair. I had the feeling the paper was mostly dwelling one-sidedly on the advantages of DBpedia. A few major advantages of Wikidata that I would expect to see mentioned:

A) If you find an error or omission in Wikidata, you can actually go and fix it instantaneously. There is no such mechanism for DBw.

B) For example, the death of David Bowie - more than half a year ago - is still not in the DBw dataset (as per July 12, 2016 as checked on http://wikidata.dbpedia.org/page/Q5383), but the death was updated in Wikidata within minutes. The paper does not make any mention of the fact that there is, through the way it is designed, a possible and inherent large delay in the freshness of the data.

C) Human-readable IRIs are presented solely as an advantage, without discussing issues of anglo-centricity or the point that the relevant W3C documents suggest to use opaque URIs.

D) There is no mention of the licensing of the DBw dataset, which seems more restrictive than the licensing of Wikidata, if I understand the footnote on the page of Bowie correctly (it says the data is published under CC-BY-SA, whereas Wikidata uses CC0. The VOID file at http://wikidata.dbpedia.org/downloads/void.ttl does not state anything about the license).

I recommend a major revision, because I would really like to see a more fair and balanced comparison of the two datasets. I would like to be able to point to this paper when I get asked for a discussion of the relative merits of the two datasets, and currently I would not feel comfortable with doing that.

Most of the following points are details that are easy to fix. I am happy to have a public conversation on any of the points raised - I expect the authors to defend a few of the points I call out. But most of the following points are rather obvious and should simply be fixed.

Section 1:

1) “Wikidata … developed its own data model” - I do not understand what you mean with “its own data model”, as the data is being exported and provided in RDF. What is the sense of the term “data model” so that Wikidata has its own, but DBpedia does not? Isn’t it both just “RDF vocabularies” or “OWL ontologies”?

2) “The multilingual DBpedia ontology, organizes the …” remove comma

3) Structure: You claim that the DBpedia ontology organizes the data, while Wikidata is schemaless. I am not sure I understand the difference. Both DBpedia and Wikidata use an RDF vocabulary / OWL ontology. Why is the one schemaless, whereas the other organizing?

4) Curation: “Wikipedia authors thus unconsciously also curate the DBpedia knowledge base” - I sure don’t hope so ;) - remove “unconsciously”, add “as a side effect”.

5) Curation: don’t split it “Medi-aWiki”, but rather “Media-Wiki” (or not at all)

6) Curation: the point omits the fact that Wikidata data is being widely automatically compared to Wikipedia content, and has partially a very high visibility (by virtue of being directly displayed in Wikipedia), and that DBpedia - in case of an extraction error - does not allow for direct curation.

7) Publication: this is listed as a difference between DBpedia and Wikidata, but to the best of my knowledge this seems pretty equal between those two. Can you elaborate?

8) Coverage: “there is no study yet that performs a qualitative and quantitative comparison”. Take a look at http://www.semantic-web-journal.net/system/files/swj1141.pdf

9) “We argue that the result of this complementarity” - you described a few differences, but I didn’t really see how they are complementing each other at this point. I would like to see that improved to actually focus on the complementary strengths of each other.

10) “Wikidata would be better integrated into the network of Linked Open Datasets” - That’s a claim, but why would that be the case? What is currently missing for integration into LOD?

11) “...and Linked Data aware users had a coherent way …” I am not sure if you are trying to say they “would have a coherent way” or “if they had a coherent way”. Also, isn’t there a coherent way already? HTTP and related Semantic Web standards?

12) “...the right balance between coverage and quality.” The paper does not discuss quality of the two resources, and also how a user of the datasets could actually choose between coverage and quality. I would like to see this argument expanded.

Page 2

13) Figure 1’s description has a footnote 5, which does not exist.

14) “While DBpedia has a … commonly used ontology” -> replace “commonly” with “widely”, and also, where is the ontology used? (i.e. citations needed)

15) “... people face difficulties when confronted with … Wikidata schema.” -> Reference needed that this is indeed the case.

16) “As a result, with the DBpedia Wikidata (DBw) dataset can be queried with the same queries that are used with DBpedia.” rephrase

Section 2

17) After the first Wikidata and before citation [7] there is too much whitespace

18) “Wikidata is community-created knowledge base” Add “a”

19) “... and more than 2.7 million registered users…” add “has”. Also the number 2.7 million is highly inflated (it is the number of registered users over all Wikimedia projects, and includes in particular also spam accounts), and the number of active users (about 6,700) is much more interesting. Source: https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaWIKIDATA.htm

20) “... and an optional reference.” -> “and one or more optional references”, i.e. can be more than one references.

21) “No value” marker means…” add “the”

22) “unknown value” marker means…” add “the”

23) “but exact value not known” -> “but the exact value is not known”

24) Footnote 6 seems to be wrong. The text talks about custom value, but the footnote defines a SPARQL prefix.

25) Listing 1, description: “Douglas Adams” -> “Douglas Adams’ “, add apostrophe at end

26) In the second paragraph, after DBpedia and before reference [5] there is a lot of whitespace.

27) “The DIEF is able” -> I would drop the “The”

28) “... and allows the easier integration of different Wikipedia language editions.” -> easier than what?

Section 3

Page 3

29) Section 3 says that it described design decisions. What is the design decision in the second paragraph, titled “Re-publishing minted IRIs as linked data”? What are the alternatives, i.e. what did you decide against?

30) “... most companies … keep the datasets hidden”, and you list Freebase as an example hidden dataset. What is hidden in Freebase?

31) “Datatype support in Wikidata started at the end of 2015” -> Maybe I misunderstand, but datatypes have been introduced to Wikidata in February 2013 IIRC.

Section 4

32) “...to map, in real-time, Wikidata…” - isn’t Wikidata mapped from the dumps? What do you mean with real-time?

Section 4.1

33) “... we define Wikidata property to ontology mappings.” -> “to DBpedia ontology mappings.”

Section 4.1.1

34) “... and at the same time crowd-source the DBpedia ontology.” -> Section 1 states that the DBpedia ontology is “relatively stable”. It sounds to me that crowd-sourcing and the promise of stability are highly contradictory. What am I missing?

Page 4

35) Fig. 2 has a box labeled with “Virtuose”. Should be “Virtuoso”

Section 4.1.2

36) “The value transformation … as functions.” Sentence is ungrammatical.

37) “$2 replaces the placeholder with a space the wiki-title value…” Sentence is ungrammatical

Page 5

38) “If mappings for the current Wikidata property exist…” - and if not?

Section 4.2

39) “If a DBpedia class is found, all super types are assigned…” Does this follow only the DBo or also Wikidata’s P279?

40) “After the redirects are extracted, a transitive redirect closure (excluding cycles) is calculated” -> are there any cycles? There shouldn’t be. Is this reported?

41) “The first step is performed in real-time…” As above - what does real-time mean here?

Section 4.3

42) “We append an additional hash on the IRI” - a hash of what?

Page 6

43) Table 1, row “-Other”, “Wikidata statements DBpedia ontology”, sentence incomplete

44) Table 1, “Mapping Errors” list 2.9M errors. Why so many?

45) “Aliases label and descriptions” - add one or two commas.

Section 6

46) The title of the Section is “Statistics and Evaluation”, and whereas I see plenty of statistics, I didn’t see much of an evaluation.

Page 7

47) Table 6, would be more useful if you also provided the labels for the properties

48) The count in Table 6 uses a comma as a separator instead of a dot (as in Tables 3 and 4).

49) “Wikidata does not have … date paoperties” -> properties

50) “most frefuent at the moment” -> frequent

51) “We generated [854k] redirects - including transitive.” How many of these were transitive? I would not expect many in Wikidata, that is why I am asking.

52) “According to Table 2, a total of 2.9M errors originated from the schema mappings and 42k triples did not pass” - Could you provide a bit more insight into the nature of these millions of errors? Are these problems in Wikidata, the mapping, in DBpedia?

Section 7

53) “The DBpedia publishing workflow guarantees: a) long-term availability through the DBpedia Association” - Are you expecting the DBpedia Association to be able to guarantee a more long-term availability than the Wikimedia Foundation?

54) “b) agility in following best-practices as part of the [DIEF]” - I am trying to understand this sentence, but I fail. What does it mean?

55) “In addition to the regular and stable releases of DBpedia we provide more frequent dataset updates from the project website./footnote{http://wikidata.dbpedia.org/downloads}” - What is the frequency of the regular releases? I checked the given URL (on July 12, 2016), and the three downloads there were named 20150307, 20150330, and 20160111 - so the last update was more than half a year old, the one before was 10 months earlier. How often is “more frequent”?

Page 8

Section 8

56) “Since DBpedia provides transitive types directly, queries where e.g. someone asks for all ‘places’ in Germany can be formulated easier.” -> The experience with the Wikidata SPARQL endpoint shows us that the materialization of transitivity seems to have hardly an effect on the ‘easiness’ of query formulation, given that the endpoint supports transitivity. I would like to see some support for this claim before seeing it published.

57) “Finally, the DBpedia queries can, in most cases directly or with minor adjustments, run on all DBpedia language endpoints.” - What is the advantage of that? Sure, I understand, you can take a query from the French DBpedia endpoint and run it on the Greek endpoint, but why would you? Wouldn’t a single unified dataset with all the knowledge be more useful for most applications? For what application would this be an advantage?

58) Listing 5: In #DBw, can I not use en:Germany instead of dw:Q183?

59) Listing 5: In #wikidata, you can also use the standard-conforming FILTER (LANG(?label)=’en’) instead of the SERVICE call. But if you insist on using the proprietary SERVICE call, it would make much more sense to use an example with at least three different variables, or else the advantage of the SERVICE call is not visible.

60) Also, in the #wikidata query, you probably would like to use wdt:P31/wdt:P279* as the property - you forgot the *, or else the answers won’t be comparable (I assume that you materialize the whole transitive closure in DBw).

61) Listing 6, #DBw: are you sure the first predicate is rdf:statement, and not rdf:subject?

62) Also, wouldn’t it make sense to simplify the Wikidata query to
?person p:P26/pq:580 ?marriage_date
Instead of the three lines for the triple pattern in DBw we would have a single line in Wikidata.

63) “Converting a dataset to a more used and well-known schema, it makes it easier to integrate the data.” remove “, it”. Also, for the claim that DBpedia is still more used and better known than Wikidata, I would like to see some supporting material for that.

64) “The fact that datasets are split according to the information they contain makes data consumption easier when someone needs a specific subset” - I didn’t see this split mentioned anywhere in the paper. Care to elaborate?

65) “... and fill in semi-structured data that are being moved to Wikidata.” Only for the data that is moved to Wikidata? You are not planning to use the data that is originally entered into Wikidata, and has not been moved from Wikipedia? How do you keep track of whether some data has been moved or has been originally entered into Wikidata?

66) “It is also plan of short-term plan ...” - add “a”

67) “...to fuse all DBpedia data into a single namespace…” - What does this mean? Given that the dbo-namespace uses the labels of the properties, like dbo:country, if you merge them in a single namespace, does that not lead to namespace clashes?

Section 9

68) “...the daily number DBw visitors...” - add “of”

69) “Which indicates that this dataset is heavily used” - how do you figure that? I mean, which numbers constitute heavy use, medium use, light use, and how did you decide that?

Page 9

70) Reference 3: unicode problems in the authors names, also rdf is not capitalized.