Completing RDF Data in Linked Open Data Cloud using Formal Concept Analysis

Tracking #: 758-1968

Authors: 
Mehwish Alam
Aleksey Buzmakov
Victor Codocedo
Amedeo Napoli

Responsible editor: 
Guest Editors EKAW 2014 Schlobach Janowicz

Submission type: 
Conference Style
Abstract: 
In the last years there has been a huge increase in the amount of information published as Linked Open Data (LOD). Its popularization and quick growth has led to challenging aspects regarding quality assessment and data exploration of the RDF triples that shape the LOD cloud. Particularly, we are interested in two important aspects of these challenges, namely how to deal with different data schemas and how to process heterogeneous resource descriptions. In this work we propose a novel technique to overcome these issues by the implementation of a knowledge discovery process based on Formal Concept Analysis which automatically detects incomplete information. We propose a structure for the organization of the RDF triples through a concept lattice of graph patterns, providing a powerful navigation mechanism while allowing for the discovery of implication rules that can be used to complete the LOD cloud. Such an organization also provides a way to help users in formulating specific SPARQL queries.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
[EKAW] reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Aug/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject
1

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)
3

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present
3

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Review
Please provide your textual review here.

The paper proposes an approach based on Formal Concept Analysis (FCA) and heterogeneous pattern structures to support LOD data completion and exploration. Through the use of a concept lattice the approach discover implication rules which can be used for quality assessment and data completion. The same concept lattice is also able to provide a SPARQL query map which allows the user to navigate and discover triples based on a query refinement/expansion process.

The approach is rather interesting, and some preliminary experimentation results are reported. There are however some issues to be discussed and clarified.
- what I miss is a concrete plan on how to apply the approach for real: for instance, how could the approach be applied to improve the quality of DBpedia or Freebase? Segmenting the datasets per topic?
- it looks that the number of rules produced is rather varying on the different datasets considered in the experimentation? E.g., 50 for 50000 triples vs 47 for 4700. Any clue or insight on this?
- the proposed visualisation mechanism for the concept lattice seems rather unfeasible, especially for large lattices (even already in the small explanatory example used in the paper (7 cars model) the graph to navigate with the browser is pretty wide). I can't immagine of navigating DBpedia with this approach.
- Actually, an evaluation of the use of the approach for navigating the SPARQL query map, is not provided. It would be also interesting to investigate if the applicability of the approach may depend on the size of the dataset to be navigated (e.g., I can immagine that it probably doesn't make much sense to use all the infrastructure to navigate a small dataset, it could be quicker with a couple of SPARQL queries, but at the same time navigating a large dataset with the proposed approach may be practically unfeasible)
- some experimentation aspects are missing: for instance, how many users evaluated the implication rules generated by the approach?

Other comments:
- table 2: "D-f" should be "E-f"
- page 12: "All data sets considered in this paper are visualized using this kind of navigation tree (see http://www.loria.fr/~abuzmako/EKAW2014)." Only the 7 cars example was available on line

Review #2
Anonymous submitted on 19/Aug/2014
Suggestion:
[EKAW] reject
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
-1 weak reject
== -2 reject
== -3 strong reject

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
3 (medium)
== 2 (low)
== 1 (none)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
4 good
== 3 fair
== 2 poor
== 1 very poor

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
3 fair
== 2 poor
== 1 very poor

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
3 fair
== 2 poor
== 1 very poor

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
3 fair
== 2 poor
== 1 not present

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
3 fair
== 2 poor
== 1 very poor

Review

This paper presents an approach to use FCA to suggest additions to RDF graphs (ontology completion, refinement), and concept lattices as ways to navigate alternative SPARQL queries (nodes in the concept lattice correspond to a SPARQL query). This is interesting work but it is preliminary and not really novel, also the related work section and evaluation fall short.

FCA has been used for ontology refinement and completion before (e.g. in the context of description logics), but the authors do not mention this work. Also concept lattices as a means to explore RDF graphs (knowledge discovery) have been applied before [1]. This is not a problem, were it not for the fact that the authors do not refer to earlier work, nor properly compare their approach (e.g. with machine learning approaches for ontology completion, and other methods).

The evaluation conducted is not really sufficient to prove the quality of the approach. If the argument is that FCA and CL's help knowledge discovery, then preselecting a very limited number of predicates per category for the evaluation undermines this claim. Also, the lack of 'real' recall numbers is problematic. The authors could have removed a selection of triples from the original graphs to see how well their method detects the triples. Also, no notion is given as to the importances of certain triples (e.g. type triples) compared to others.

It would have been helpful if the authors would have explained their plans to extend the paper if accepted for the combined track.

More specific comments:
p1.
* "World Wide Web" -> "The World Wide Web" (... articles are missing in more places)
* The footnote should refer to the RDF specification
p2.
* I do not understand the explanation about why it is needed to assess the quality of linked data (also given the rest of the paper). This does not seem to be what you are doing. Also DBPedia is not a very good example as it is a fairly well curated resource.
* In one sentence you mention quality assessment, and later this becomes "data completion".
* Section 2 is missing from the overview
* Section 2 is incomplete. It only mentions 3 related papers. There should me quite some more (as a quick google scholar search will show). Specifically I would like to see some discussion on the relation with RDF triple store index optimization and the concept lattice, and the use of FCA for ontology refinement.
p 3.
* The sentence in which sw technologies are "linked" to other data sources should be rephrased.
* You do not consider blank nodes: why not?
* Fully explaining RDF and FCA is not really necessary for this audience (though FCA might be)
* It is good to make it very explicit that you are working with a running example. It took me a while to notice the correspondences between the sparql query, the results, the FCA examples and concept lattices. Also your coding scheme (A-a, A-b) makes reading the table hard.
p 6.
* "sports cars" is the correct term: you can remove the "(sic)"
p 7.
* The type of the object of the dbo:manufacturer predicate is determined by the range of the property, not its domain.
p 9.
* Table 5, shouldn't [1963, 1965] be [1963,1967]?? (cf table 4.)
* "r and object" -> "r an object"
p 11.
* Figure 2. The concept lattice visualization does not allow you to see the actual instances/resource/entities that you are interested in. Doesn't that defeat the whole purpose?
p 12.
* Table 7 mentions the execution time... but the paper does not say anything about the implementation of their framework.
p 13.
* If the data would have been more sparse, how would that have affected your results?

[1] http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/PostersDemos/swc/s... also Gottron et al. ESWC 2014.

Review #3
Anonymous submitted on 25/Aug/2014
Suggestion:
[EKAW] combined track accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject
1

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)
4

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

4

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present
2

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4
Review
Please provide your textual review here.

This paper presents an application of Formal Concept Analysis (FCA) to RDF graphs in the context of Linked Open Data.
The method that is presented generates a graph index that enables to navigate in an RDF graph. It enables to discover regularities: entities that share properties and similar entities that lack some of these properties
Hence, it enable to complete RDF graphs with relevant property values.

The authors suggest that the graph index may also serve as a guide to generate SPARQL queries according to the index patterns. This may be somehow explained and elaborated.

The method seems very interesting in case of noisy incomplete data.
But the largest example that is presented contains only 50,000 triples.
How does it scale with real size datasets, e.g. 1 million triples ?

The examples in the evaluation are not very convincing, they look like toy examples. A real evaluation campaign with different datasets and different target domains (i.e. not only toy subsets of DBpedia) may be conducted.

DBPedia
->
DBpedia

"The type information plays an important role"
->
rdf:type

"Linked Open Data (LOD) [1] has become the de facto standard for publishing data on-line"
->
It is not really (or not only) a "de facto" standard because it is the result of the work of the W3C which is a standardization organization.

semantic web
->
Semantic Web

3.1 Linked Open Data
In the definition of the RDF labelled graph, there is a confusion between edge and predicate. A predicate is the label of an edge.

"Finally all the assertions present in an RDF graph are given as
follows A in V x V x E."
->
V x E x V

In RDF, blank nodes cannot appear in predicate position, it should be corrected in the definition (two occurrences)

The reference to SPARQL W3C Rec may be updated to SPARQL 1.1 :
http://www.w3.org/TR/sparql11-query/

"we use also use"

Table 1:
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns\#
->
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#

"the remainder of this."
->
the remainder of this paper.

"(A2, B2) a superconcept (A1, B1)."
->
(A2, B2) a superconcept of (A1, B1)

gray cells
->
grey

"labeled as “Islero, 400GT” in Figure 1"
->
In Fig 1., it is labeled “Islero, 450GT”

In such a case, “rdf:type” and “dbo:manufactured” correspond
to a level of “semantics” to which a query can be understood as a matching
of meanings entailing a deeper level of description.
->
This sentence is vague.

where all cars of the brand “Lamborghini”
->
where all cars are of the brand “Lamborghini”

it provide
->
provides

owl:class
->
owl:Class

"In reality, LOD do not always consists of triples of resources (identified by their Universal Resource Identifiers or URIs) but contains a diversity of datatypes including dates, numbers, lists, strings and others."
->
RDF lists are represented as triples using URIs (rdf:first, rdf:rest, rdf:nil). They are not considered as datatypes.

"For any given relation (object or literal), we can define the pattern structure Kr = (G, (Dr, ^), δr) where (Dr, <=)
is an arbitrary order"
->
Shouldn't it be <= instead of ^ ?

5 Concept lattice as an index for the RDF graph

"a formal concept represents a pattern in the RDF
graph which, in terms of SPARQL, can be expressed as a SPARQL query"
->
This idea is presented intuitively several times, but it is not really formalized and validated.

The comparison between SQL and SPARQL data bases is unfair because, using SPARQL, one can query the schema, e.g. discover properties and classes.

In Fig 2, use: "rdf:type"

"we propose to visualize a concept lattice"
->
Is there a graphic software tool ?

The scenario of using DBpedia for buying a sport car is not credible.

In Fig 7, what is the relation between the Exec time and the number of triples ? How does it scale ?

"when the evaluator provides the last “yes” answer for an implication rule"
->
I do not understand this sentence.

"Based on the concept lattice obtained by heterogeneous patterns structures, a navigation mechanism over the RDF triples is provided which also takes into account the suggestion of SPARQL queries."
->
I do not understand the end of the sentence on SPARQL queries.

References

rdf
->
RDF

dbpedia
->
DBpedia