Micro Analysis of Linked Open Data Quality and Graph Traversals for Cultural Heritage Research

Tracking #: 2296-3509

Go Sugimoto

Responsible editor: 
Special Issue Cultural Heritage 2019

Submission type: 
Full Paper
In cultural heritage, many projects have generated a large amount of Linked Open Data (LOD) in the hope that it transforms scattered data into connected global graphs, which are supposed to advance our research with machine-assisted intelligent tools. However, the investigation of aggregation and integration of heterogeneous LOD is rather limited partly due to data quality issues. To this end, the author examines end-user’s “researchability” of LOD, especially in terms of graph connectivity and traversability. Three W3C recommended properties (owl:sameAs, rdfs:seeAlso, and skos:exactMatch) as well as schema:sameAs are inspected for 80 instances/entities for ten widely known data sources in order to create traversal maps. In addition, data content (literals, rdf:about, rdf:resource, rdf:type, skos:prefLabel, skos:altLabel) is assessed to capture the overview of the data quantity and quality. The empirical micro study with network analyses reveals that the major LOD provides relatively low number of outbound links, proprietary RDF properties, and few reciprocal vectors. These quality issues suggests that the LOD may not be fully interconnected and centrally condensed, confirming the outcomes of previous studies. Thus, their homogeneousness casts a doubt on the possibility of automatically identifying and accessing unknown datasets, which implies the needs of traversing strategies to maximise research potentials.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 29/Oct/2019
Major Revision
Review Comment:

This paper presents an interesting micro analysis of Linked Open Data resources, assessing the researchability they enable in the realm of Cultural Heritage. The paper is well written and the analysis is technically correct, with proper justifications for most of the decisions made. However, in many aspects it reminds of similar studies conducted in the past, and the particular attention supposedly given to the Cultural Heritage domain (which would be the main contribution) is imprecise: starting from the selected entities listed in Table 1, the study is clearly about History and Geography, which are just the context in which CH happened. I would ask the author to change the focus and argumentation to these areas, or rather include in the study entities related, for example, to artworks and their meanings or interpretations, to artists and their styles, their reasons, purposes or inspirations, etc. In any case, it is highly recommended to strengthen the conclusions (some sentences, like the one about the good practice of reusing terms from well-known RDF vocabularies are rather shallow).

- "These quality issues suggests" --> "These quality issues suggest"
- "DBpeida" --> "DBpedia"

Review #2
Anonymous submitted on 30/Oct/2019
Major Revision
Review Comment:

The paper describes a study on the connectivity and traversability of linked open data published and/or used in the cultural heritage domain. The analysis is carried out by investigating ten prominent data sets, some of them generic (e.g. DBpedia, YAGO, Wikidata) and others more specific to cultural heritage (e.g. VIAF, Getty vocabularies). The objective is to determine whether these datasets can be traversed in an automated manner, - i.e. by following only explicitly defined links via owl:sameAs, rdfs:seeAlso, skos:exactMatch, schema:sameAs - , in order to potentially create aggregated data views for instances included in different datasets, one of the premise of the linked data vision.
Positive points of this paper include:
- The subject of the study is timely and interesting. The linked data initiative and corresponding standards have been existing for some time, so it is interesting to examine the level and manner of adoption in cultural heritage. On the other hand there are still discussions on best practices for linked data publishing, so it is interesting to take a look at the current state of affairs.
- The methodology is described in an appropriate manner, including assumptions and limitations, in the first sections of the paper.
- The study made efforts to cover the subject from different perspectives: in addition to a statistical analysis on the connectivity of the datasets, a manual more qualitative analysis at instance level is also carried out.

Weak points:
- The English, in which the sections detailing most of the analysis are written, is poor. I would definitely recommend a thorough proof-reading. I assume that some strange terminology, such as "logical framework, reciprocal vector links" are also included because of that(?)
- The analysis is limited in coverage. Most importantly, it does not include cultural artifacts such as works of art, monuments and/or books. Although the author mentions this as a conscious choice when describing the scope of the paper - in footnote 3-, and it is certain that it is impossible to cover every type of artifact, still I believe that the above artifacts are too important for cultural heritage to be completely ignored. In the same manner, the absence of Europeana, a major source of linked data in the domain, is an important limitation - even if corresponding linked data is only available in JSON or the API is in alpha version, which are mainly technical reasons.

Minor points:
- The first section of the paper has as title "Background - Linked Open Data Quality" - I would replace that with "Introduction".
- Section 2.1. "Related Work" is part of section 2 "Methodology". That sounds strange. I would make "Related work" a completely separate section.

Typos, etc. (just a few of them, a really thorough proof reading is required):
***Section 1***
distributed data into connected global graphs -> into a connected global graph
which facilitate -> which would facilitate
logical framework:?? I don't understand this
One of the problems is the quantity and quality of data -> ?? Two of the most important issues are the limited quantity and low quality of data ??

***Section 2.1***
is the key for the users to navigate themselves in the network -> indicates whether users can navigate in the network

***Section 2.2***
This enables the users' automation of LOD traversals:?? I don't understand this

***Section 2.3***
Provide proper citations for Europeana EDM, CIDOC-CRM, FRBR, DCMI

***Section 2.4***
a various type of charts -> various types of charts

***Table 2***
4 dual identify data -> ??? 4 duplicated entities ???
10 dual identify data -> ??? 10 duplicated entities ???

***Section 3.2***
8 other links are found which bound for outside the 10 data sources -> ??? 8 other links are found to data sources that are not included in this study ???
and average (58.8) of the whole entities -> ??? and average (58.8) of the whole set of entities ???
understandable that Getty -> understandable since Getty

***Section 3.3***
Apart from DBpedia -> In addition to DBpedia
are not widely recognised -> are not widely used
no sources links -> no sources link
A little deviation -> A small deviation
Compare to -> Compared to
clearly exposes -> clearly illustrates
Unlike agents, Wikipedia is connected by YAGO -> Unlike for agents, Wikipedia is connected to YAGO
access detail information -> access detailed information
the semantic of rdfs:seeAlso is weak: ?? I don't understand this
A typical case of an entity -> A typical entity

***Section 3.4***
The economy of the creation of date entities may show serious issues: ?? I don't understand

***Section 3.5***
yet even more idiosyncratic than other entities: ?? I don't understand

***Fig. 5***
The amount of outgoing links to 10 data sources found in 20 agent entities -> The amount of outgoing links to the 10 data sources and 20 agent entities examined in this study

***Section 3.7***
The lowest source VIAF still hold over 37.2% -> The lowest source in terms of links to the other datasets, VIAF, still hold...
The statistics indicate the closed and close connections of 10 data sources..: ?? I don't understand
[11] note -> Ding et al. [11] note
Overall percentage is expectedly low -> The overall percentage is, unsurprisingly, low
What is a content property???

***Section 4.1***
for the representative entities for humanities research -> for representative entities of humanities research

***Section 4.2***
the first choice should be given to the standard properties -> standard properties should preferred
reciprocal links are needed with care -> reciprocal links need to be added with care
reciprocal vector links:?? what is this?

Review #3
By Efstratios Kontopoulos submitted on 10/Nov/2019
Minor Revision
Review Comment:

The paper presents a study that investigates the degree of “interconnectivity” between LOD sources. Although the specific work is focused on Cultural Heritage (CH), the results are relevant to the majority of domains, since the problems discussed are not specific to a particular domain.

All in all, the paper is well-written and easy to follow, although there are several typos and grammatical errors. Please make sure to fix them in the next version of the paper. Otherwise, I found the paper full of interesting insights and thought-provoking discussion. Although I am not a CH expert, this work looks very relevant for people working in the specific domain. My only concern, most probably because I don’t have experience in the CH domain, has to do with my lack of certainty whether CH researchers indeed follow this “link-hopping” approach during their research. In case they do, then this work does make sense to them.

Below are some more elaborate remarks per section:

Section 1: It is stated that 10% of CH material is digitized, which, in my view, is a very interesting stat. The author cites a white paper to support this claim. Are there perhaps any other relevant sources to support that?

Section 2.1: Could you please give a brief explanation about the “problem of co-reference”?

Section 2.2: I particularly liked the first paragraph in this subsection, which gives good pointers indicating the contributions of this work.

Section 2.4:
- Besides the sources you selected, are there any other potential “contenders”? Simply mentioning a few of them would probably be a good reference point for CH practitioners.
- I would suggest replacing “EXCEL file” with “spreadsheet.

Section 3 (and subsections): BabelNet is referred several times in various places in the paper as “BebelNet” - please fix.

Section 4:
- It is stated in 4.1 that “…high-quality datasets are hugely biased toward a couple of data sources, especially generic knowledge bases.” Also, in 4.2 it is stated that “…they are not aware of data sources and their ontologies in their query time.” This problem is absolutely true and it’s not limited to CH only! A solution could be that content creators are aware of other more “low-level” vocabularies that are more relevant to their work. But how could this be achieved? I don’t have a concrete answer - it’s just some food for thought.
- In 4.2 it is stated that “…it seems necessary for the web community to help major LOD dataset maintainers to identify incoming LOD as much as possible and enrich the datasets…”. Unfortunately, this sounds more like wishful thinking and less than concrete pointers for the future. The study in this paper is very useful and it would be even better if the author could give some more elaborate pointers (aka a “cookbook” perhaps) with concrete steps that could be made to mitigate this serious problem.