Link maintenance for integrity in linked open data evolution: Literature survey and open challenges

Tracking #: 2255-3468

Authors: 
Andre Regino
Julio Kiyoshi Rodrigues
Julio Cesar dos Reis
Rodrigo Bonacin
Ahsan Morshed1
Timos Sellis1

Responsible editor: 
Oscar Corcho

Submission type: 
Survey Article
Abstract: 
RDF data has been extensively deployed describing various types of resources in a structured way. Links between data elements described by RDF models stand for the core of Semantic Web. The rising amount of structured data published in public RDF repositories, also known as Linked Open Data, elucidates the success of the global and unified dataset proposed by the vision of the Semantic Web. Nowadays, semi-automatic algorithms build connections among these datasets by exploring a variety of similarity computation methods. Interconnected open data demands automatic methods and tools to maintain their consistency over time. The update of linked data is considered an key process due to the evolutionary characteristic of these structured datasets. However, data changing operations might influence well-formed links, which turns difficult to maintain the consistencies of connections over time. In this paper, we propose a thorough survey that provides a systematic review of the state-of-the-art in link maintenance in linked open data evolution scenario. We conduct a detailed analysis of the literature for characterising and understanding methods and algorithms responsible for detecting, fixing and updating links between structured data. Our investigation provides a categorisation of existing approaches as well as describes and discusses existing studies. The results reveal an absence of comprehensive solutions suited to fully detect, warn and automatically maintain the consistency of linked data over time.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Mikel Emaldi Manrique submitted on 30/Aug/2019
Suggestion:
Major Revision
Review Comment:

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

In this paper, authors present a survey about methods and algorithms responsible for detecting, fixing and updating links between LOD datasets. They formalize the problem of broken links appropriately and they present and categorize different works related to the topic following a well explained methodology, but authors should tackle the following recommendations in order to improve the paper.

(2) How comprehensive and how balanced is the presentation and coverage.

Regarding to the coverage of the survey, authors have analyzed 22 works, what seems to be enough. For searching works related to the topic, they have entered a set of keywords in most popular scientific search engines. Regarding to this aspect, a wider search at the most popular conferences related to the Semantic Web and Linked Data (e.g. ISWC and ESWC) is missing, and at their associated workshops as well. Inspecting those conferences could arise more works that could be not detected initially through the search engines. Authors have established a set of exclusion rules for filtering works, but the motivation behind those rules is not explained.

The main issue regarding to the coverage of the paper is related to the fact that authors include a set of works that, according to the description given, there are not focused to detect and/or fix broken links between linked datasets (e.g. Auer and Herre (2007), Liu and Li (2011), Roussakis et al. (2015), Galani, Papastefanatos and Stavrakas (2016), Papavasileiou et al. (2009), Pernelle et al. (2016), Kondylakis et al. (2017) and Porzaferani and Nematbakhsh (2013)). Most of those works aim to detect modifications at datasets, but detecting broken links is not among their objectives. Those tools could be used as a basis for knowing that a dataset has been modified but the task of determining if this modification has produced a broken link and how to fix it is in charge of the user or another subsystem. Authors should contextualize the inclusion of those works in a more consistent manner or replace them by tools that specifically solve the broken link detection problem.

Regarding to the discussion section, a wider analysis of each work is expected. In this section, authors present their results in a non-structured manner which makes hard to reach to a clear conclusion. They should focus on answering the research questions, in a more detailed manner, explaining widely the data presented at Table 17. A quantitative evaluation is missing, in order to know the accuracy/recall of different tools when detecting broken links. In addition, authors should elaborate a wider conclusion section.

(3) Readability and clarity of the presentation.

The paper is clearly written; however, some concepts need further explanation or reformulation. At section 2, the example of a triple should be formulated in a more standardized way, like:

dbr:Abraham_Lincoln dbo:birthDate "1809-02-12"^^xsd:date ;
dbo:birthPlace dbr:Hodgenville,_Kentucky.

The nomenclature used for categories of approaches should match the names of the subsections at section 4 (e.g. "Informed by changes" category is named "Change detection" in the title of the corresponding subsection). Regarding to "High Level Modifications" category, authors should illustrate what they mean when they refer to a "high-level change" on each of analyzed works.

Regarding to the bibliography, authors mix two different citation styles (i.e. "Popitsch and Halshofer (2011)" and numerical reference, "[11]"), this must be fixed as it makes references hard to find in the bibliography.

(4) Importance of the covered material to the broader Semantic Web community.

This survey tackles an interesting topic for the community that has not been fully solved nowadays, as it seems that there is not a tool for detecting and/or fixing broken links in a fully automatic manner.

Review #2
Anonymous submitted on 17/Sep/2019
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions:

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

This paper presents a literature review on the problem of link maintenance in the linked open data. It focuses on five different links namely, owl:sameAs, owl:differentFrom, rdfs:seeAlso, skos:exactMatch and skos:closeMatch. The authors applied a systematic review which led to the selection of 22 relevant papers classified into 7 different categories, such as approaches that are based on the notification of datasets managers, the use of deltas between datasets or the use of ontologies to represent changes. The authors finished the paper by a discussion and by stressing some future chalenges in the area.

This paper is the first survey paper focused on the link maintenance problem in RDF datasets. It can be used as a basis for starting work on the burgeoning topic of RDF data evolution and more specifically in link maintenance. However, not all the selected papers deal with the link evolution but some of them deal with the general problem of RDF data evolution. No particular processing is proposed for dealing with link maintenance. May be the authors can provide more details on this specific topic for most of the papers presented. Moreover, it misses references to existing survey papers that deal with other aspects of data and knowledge evolution.

(2) How comprehensive and how balanced is the presentation and coverage.
The set of selected papers covers the main existing works that focus on data evolution by excluding an extensive bench of works that are more dedicated to ontology (i.e. conceptual level) evolution which is not the purpose of this paper. The queries that are used in the systematic review include the main and most popular terms that refer to the evolution of RDF data and links. However, it is surprising that the authors do not give details on the specific links (among the five) in the presented papers. For example, one may wonder which specific process is applied for owl:sameAs and which one is applied for rdfs:seeAslo as they do not have the same semantics ? Moreover, the paper may be extended by giving the semantics of these five links and present whey these links and not others?

(3) Readability and clarity of the presentation.

The paper is well written and easy to follow. However, the paper can be improved by giving some comparative discussion of the papers of each category the ones presented in the tables from 10 to 16. Also, some categories may be grouped like “Notification” and “Informed by changes” since they both rely in approaches that are based on change detection.

(4) Importance of the covered material to the broader Semantic Web community.

The paper proposes a first survey on link maintenance and present the most important works in the area. However, it can be improved by giving some information on the tools, if they are available online, what are the parameters if any, the need of human effort, …
Some tools (especially those that allow to fix links) may be worth to compare them experimentally with respect to their efficiency and also to the quality of their results the precision and recall ?

Review #3
Anonymous submitted on 30/Jan/2020
Suggestion:
Major Revision
Review Comment:

The paper presents a survey that provides a systematic review of state of the art pertaining to the link maintenance problems in linked open data. The authors analysed related literature and identified approaches and algorithms responsible for detecting, fixing and updating broken/invalid links among and within datasets where they provide a categorisation of existing approaches and define open challenges.

The paper starts by introducing the work and giving an overview of related work. In section 2, the authors formally define the link maintenance problem. Their systematic literature review methodology is then described in Section 3, where the authors define a set of five research questions. In Section 4, the authors analyse the resulted papers from the systematic survey. In Section 5, the authors answer the introduced research questions and present a set of open challenges related to the link maintenance problem. Finally, the paper is concluded in Section 7. The paper also includes one appendix which presents more details about the paper collection process through the systematic survey process.

Suitability: I found the presented categorization as well as the open challenges are the strong points of the paper. The open challenges introduced by the paper is a good start point for PhD students.

Comprehensiveness: The presented work is covering most of the related work except for the last research question, more details will follow.
Readability: The paper is written in good English and well structured, which make it easy to follow the presented ideas.

Importance: The covered material is important to the Semantic Web community.

RQ-3 and RQ-4 are overlapping to some extent. In my opinion, having one RQ about linked data and the other about other data models will make things clearer.

I think that Section 4.8 misses a lot of related research in the field of Link discovery. Specially, the machine learning algorithms for automatic links finding (See for example [1-3], and many more). The author should consider such algorithms when answering RQ-05.

Section 4: I think adding one final category about hybrid systems that use more than one technique from the different categories introduced in Section 4 will complete the presented categorisation. See [5] for example. Also, the DSNotify framework introduced in the paper fall into this category.

I think that RQ-5 needs more investigation. I know already some automatic techniques for link maintenance; some baes on instance linking (e.g., [1-5]), some based on ontology matching (5-7) and many more.

[1] R. Isele, C. Bizer, Learning linkage rules using genetic programming, Proceedings of the 6th International Conference on Ontology Matching-Volume 814, pp. 13-24, 2011
[2] A.C.N. Ngomo, K. Lyko, Eagle: Efficient active learning of link specifications using genetic programming, Extended Semantic Web Conference, pp. 149-163, 2012
[3] WOMBAT - A Generalization Approach for Automatic Link Discovery by Mohamed Ahmed Sherif, Axel-Cyrille Ngonga Ngomo, and Jens Lehmann in 14th Extended Semantic Web Conference, Portoroz, Slovenia, 28th May - 1st June 2017
[5] F. M. Suchanek, S. Abiteboul, and P. Senellart. PARIS:Probabilistic Alignment of Relations, Instances, andSchema.PVLDB, 5(3), 2011
[6] Gal, Avigdor, et al. "Automatic ontology matching using application semantics." AI magazine 26.1 (2005): 21-21.
[7] Bühmann, Lorenz, Jens Lehmann, and Patrick Westphal. "DL-Learner—A framework for inductive learning on the Semantic Web." Journal of Web Semantics 39 (2016): 15-24.

Other comments:
Abstract: “... is considered an key process” → “... as a key process”
Introduction: Add a reference to the definition of RDF triple in section 2, the first time you mention it
Section 2, also in other places: “A RDF triple” → “an RDF triple”
Section 2: “an unique” → “a unique”
Section 2: “SameAs” → “sameAs”, also use \texttt{} for all property names
Section 4.3: “after a number of changes in an certain elapsed time” → “after a number of changes after a certain elapsed time”
Section 4.4, also in other places: “DBPedia” → “DBpedia”
Section 4.5: “According to Galani, Papastefanatos and Stavrakas (2016)” → “According to Galani et al. (2016)”
Section 4.5: “... changed; in addition, ...” → ““... changed. In addition, …”
Section 4.5: “benefits with” → “benefits from”
Section 4.5: “increasing the level of abstraction of changes is proportionally related to the quantity of these types of changes, resulting in a greater level of abstraction.” ???
Section 4.6: “a ontology” → “an ontology”
Section 4.6, also in other places: “an URL” → “a URL”
Section 4.6: “was included” → “has included”/ “includes”
Section 5: “state-of-the-art” → “state of the art”
Section 5: Define the A-box and the Tbox
Section 5: “The benefits of having a RDF dataset with no or very few broken links include the increase of the trust in the consistency of the dataset increases and ...” → remove “increases”
Section 5: “... matter, so if a triple is moved to the end of the file that stores the triples, it cannot be computed.” → ambiguous “it”
Section 5: use the same numbering for tasks in the list and the text


Comments

Unsupervised Link Discovery Through Knowledge Base Repair (http://svn.aksw.org/papers/2014/ESWC_COLIBRI/public.pdf) by Axel-Cyrille Ngonga Ngomo, Mohamed Ahmed Sherif und Klaus Lyko in Extended Semantic Web Conference (ESWC 2014)