Characterizing RDF Graphs through Graph-based Measures - Framework and Assessment

Tracking #: 2446-3660

Authors: 
Matthäus Zloch
Maribel Acosta
Daniel Hienert
Stefan Conrad
Stefan Dietze

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Abstract: 
The topological structure of RDF graphs inherently differs from other types of graphs, like social graphs, due to the pervasive existence of hierarchical relations (TBox), which complement transversal relations (ABox). Graph measures capture such particularities through descriptive statistics. Besides the classical set of measures established in the field of network analysis, such as size and volume of the graph or the type of degree distribution of its vertices, there has been some effort to define measures that capture some of the aforementioned particularities RDF graphs adhere to. However, some of them are redundant, computationally expensive, and not meaningful enough to describe RDF graphs. In particular, it is not clear which of them are efficient metrics to capture specific distinguishing characteristics of datasets in different knowledge domains (e.g., Cross Domain vs. Linguistics). In this work, we address the problem of identifying a minimal set of measures that is efficient, essential (non-redundant), and meaningful. Based on 54 measures and a sample of 280 graphs of nine knowledge domains from the Linked Open Data Cloud, we identify an essential set of thirteen measures, having the capacity to describe graphs concisely. These measures have the capacity to present the topological structures and differences of datasets in established knowledge domains.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Gëzim Sejdiu submitted on 01/May/2020
Suggestion:
Minor Revision
Review Comment:

The paper proposes a novel approach to efficiently compute topological graph measures for RDF datasets. The main goal of the paper is to identify a set of meaningful, efficient, and non-redundant measures, in order to describe RDF graph topologies more accurately and facilitating the development of the aforementioned solutions. It uses the conventional statistical tests, such as the analysis of correlation coefficients, results of feature selection, variability, and a supervised classification task, in order to assess a measure’s efficiency in terms of its capacity to discriminate dataset knowledge domains. A sample of 280 RDF datasets from different knowledge domains was acquired from the LOD Cloud. As a result, it provides a statistical visualization of the measured online.
The paper extends an ESWC 2019 publication [2] by the same authors. By describing formal definitions of the proposed graph measures, and investigating novel techniques for its evaluation, the paper falls in the scope of the Journal and meets the criteria required for full paper submission. As a full paper, this review will focus on the dimensions of originality, the significance of results, and the quality of writing.

Positive and negative aspects:
(+) the paper is well written and organized
(+) it gives a comprehensive analysis and evaluation
(+) the motivation is well justified
(-) it lacks comparing with other related work
(-) the scalability is not well supported. Up to what size it can scale?

The paper tackles an interesting topic that may be of interest to the community. However, in its present form, the paper suffers from a number of issues, which are outlined below.

- There is no performance analysis of the proposed method (statistics), which is important in processing large-scale datasets. Although on Page 10, Line 11 (right), the authors mention that they do employ a different graph partitioning mechanism using map-reduce, I recommend the authors to discuss this aspect more thoroughly.
- Human validation is not present. This seems critical for similarity-based techniques, specifically if there is not large ground truth. Are human-based evaluations going to be part of the future work? How your system decides which of the measures are more meaningful, efficient (if there is not a runtime reported), and non-redundant measures? Which of the features are more important? For which domain? Are these findings tunned and centric to the findings based on the chosen domains? How the system will perform with a completely new domain? i.e. Blockchain? I think this explanation is missing on the paper and would be a good addition to the paper if these questions are answered.

- A more thorough evaluation of the model would improve the quality of the paper. The authors give some details on how the model has been used, but its applicability to different kinds of datasets or use cases is not investigated in detail.

=== Minor comments ===
Page 2, line 2 (left) “vertices/ edges” → “vertices/edges”
Page 1, line 51 (right) until Page 2, line 5 (left) – I consider this sentence as being too long. It could be split into two parts i.e. after “characterize RDF graphs [3],”.
Page 2. Line 15 (left) “(depending of the size of the graph” → “(depending on the size of the graph”
Page 2. Line 16, the claim “Focusing on an efficient set of descriptive measures...” needs a reference (if there is any).
Page 2. Line 7 (right) “thirteen”-→ “13”? I suggest that you use numbers for those above ten and make it consistent throughout the paper, i.e on page 2, line 31 (right) “fifty-four” → 54
Page 3. Line 24-15 (left) “SPARQL endpoints” → should it be “SPARQL engines” or “SPARQL evaluator” better? These optimizations aren’t on the level of the endpoint but rather on the core engine of the solutions.
Page 3. Line 50 (left) “clusters”[14].” → “clusters” [14].” Add ~\cite{}.
Page 3. Line 1 (right) “for profiling SPARQL engines." – did you mean for profiling Linked Open Data?

=== Closing Remarks ===
Overall, the paper is well-written and structured. The motivation for having a mechanism of deciding which of the measures are important when describing an RDF data is clearly stated.
However, there is space left for detailing a little more the approach when dealing with a larger dataset that doesn’t fit into a single machine and better discussing the comparative evaluation and the scalability of the proposed approach.

Review #2
By Michael Röder submitted on 05/May/2020
Suggestion:
Minor Revision
Review Comment:

The paper is an extension of [2] and aims to "identify a set of meaningful, efficient, and non-redundant [(RDF) graph] measures, for the goal of describing RDF graph topologies more accurately". The authors further define that an "efficient" measure should be discrete with respect to other measures and should add an additional value in describing an RDF graph (in comparison to other RDF graphs). The authors rely on two types of majors: general graph measures taken from [2] and RDF graph measures from [3]. The authors introduce the different measures before answering three research questions:
- Which measures does the set M' of efficient measures for characterizing RDF graphs comprise?
- Which subset of M' (M''_c) characterizes RDF graphs of a certain domain c?
- Which of the measures in M' show the best performance in classification tasks that aim to discriminate RDF datasets with respect to their domains.
The authors answer these three questions in an empirical way using 280 RDF datasets of the LOD cloud. They find 29 out of 54 evaluated measures to be efficient (i.e., these measures are within M'). 13 of these measures are identified to have an impact on distinguishing datasets of different domains from each other. In addition, the most important features per domain are determined.

=== Positive aspects

+ The research questions the paper focuses on are very important for various research fields related to RDF graphs.
+ The approach the authors apply makes sense to me. Of course, some details might be arguable. However, it is a complex work and this automatically leads to a lot of different possibilities and decisions that have to be made by the authors.
+ The article is an extension of [2]. However, the authors clearly distinguish the two articles from each other and list the extensions made.
+ The insights the authors point out are valuable and can become important for the community.
+ It is very good that the authors exclude the three domains that had a low number of RDF datasets from their per-domain experiments.
+ The authors made the detailed analysis results as well as the framework they used for the analysis available.
+ The paper is well written.

=== Minor issues
The paper has some minor issues that need to be fixed before it can be published.

- In Section 3, the authors give formal definitions of 28 graph measures. However, the chosen variable and function names need to be fixed.
-- Several letters have more than one meaning:
--- Some of them are used as value and as a function at the same time. For example, d is the maximum total degree of vertices within a graph _and_ the function that returns the degree of a single vertex. The same holds for d^+, d^- (see Page 5, 2nd column, lines 6-7) and PR (Equation 11).
--- p is used as a single property and as the fill measure (Def. 3.2 vs. Equation 12).
--- C is used for centrality measures (e.g., C_d) and for the domains of the LOD cloud (Section 3.2.3 vs. 4.1).
--- m is used for the number of edges of a single graph and for a single metric (Equation 2 vs. Equation 20).
-- The ^+ is used with two meanings. For d^+, z^+, C_{d^+} and others, it means that the measure focuses on incoming edges. However, the measure C^+_d uses it, without the focus on a directed graph since the measure uses solely values of the undirected graph representation. I would strongly suggest removing the ^+ and using something else to distinguish between the point and the graph centrality.
-- The authors follow a common usage of variable writings, i.e., capital letters are sets while lowercase letters are used for variables and functions. However, the authors do not seem to strictly stick to that (e.g., C_d denotes a single number instead of a set). I would highly suggest to try following a clearer usage of variable names.
-- In Equation 21, the authors introduce the usage of a bar on top of val(m,D_c) to express that the mean of all val(m, x \in D_c) is used. However, the bar notation has not been used before. Since the authors also use "std" and "val" as functions within the equation the usage of "avg" as an average function would be better.

- The authors select a set of 54 metrics. Of course, there are much more metrics available. Hence, the criterias why one metric has been selected and another metric is not selected is important. Here, the paper needs to be improved since some of the (implicit) decisions are not clear to me.
-- The authors define z, z^+ and z^- as the mean total-, in- and out-degree of a graph G. Here, I am missing a clearer definition. Is it the arithmetic mean? If yes, does it make sense to have z^+ and z^- since they should give the same number as z (z = m/n = m^-/n = z^- with m^- being the number of edges that are outgoing edges, which is exactly the number of edges in an RDF graph)? (Shouldn't be confused with the subject out-degree and the object in-degree of [3], which give different values since they are normalized using the number of subjects and number of objects, respectively) If it makes sense to have z^+ and z^-, why are they missing in Figure 1? I can only see avg_degree in the Figure.
-- Why several measures seem to be missing that (should) have been there to complete the set of measures? If the authors have a good reason, they should explain it (Please excuse it the reason is obvious, it does not seem to be obvious to me). If the reason can be found in a different paper, it should still be stated or the authors should explicitly refer to it.
--- h and h^+ are used but h^- is missing
--- \alpha and \alpha^+ are used but \alpha^- is missing
--- \sigma, \sigma^2 and cv are missing in Table 1 although they have been defined in Equations 17-19.
--- C_{d^+} and C_{d^-} are listed in Table 1 although they have never been defined in the text.

- The authors try to find a set of measures that show the best performance in classification tasks that aim to discriminate RDF datasets with respect to their domains. In Section 5.3, the authors present the results of the classification and focus on the measure importance for the different tasks. However, I am missing the accuracy / F1-measure of the classifier. Yes, I got that "the primary aim was not to find the best classification model". However, without these values, it seems to remain questionable whether the importance of single features for a classifier is really interesting (if the classifier may not lead to meaningful results). Especially since the authors claim that they achieved an "acceptable prediction accuracy" and that their "experiment showed that, by employing topological measures, the prediction of categories for datasets is possible". From my point of view, the paper does not show that as long as the accuracy values are not reported.

=== Writing Style
The paper is well written.

- Def. 3.1.: "A multigraph G" --> "A directed multigraph G"
- Page 8, left column, line 10: \alpha_{in} --> \alpha^+
- Equation 9: it is confusing that d is defined again although it is defined and used several times before.
- Table 1: The style guide suggests to avoid vertical bars in tables. In addition, it might be easier to sort the measures in a different way: putting the total, in and out measure in one row makes the table shorter and still easy to read (from my point of view).
- Not all URLs seem to be defined as links (e.g., footnote 1).
- Page 11, left column, line 15: "outperforms other established classification" --> "outperforms other established classifiers" or "outperforms other established classification algorithms"
- Page 11, right column, line 1: "experiment" --> "experiments"
- Page 11, right column, line 1: "graph-object after loading into memory" --> "graph-object after loading it into memory"

=== Paper References

- The paper has 58 references. However, it seems like only the first 37 references are used. I am not sure why the authors have additional 21 unused references in their bibliography. This has to be fixed.
- [54] has been cited as an arxiv.org paper although it has been published as a peer-reviewed paper.
- [58] has faulty page numbers.

=== Comments
This Section collects some comments that are not crucial to follow. However, they may increase the quality of the paper.

- I think the authors could be more precise in their formulations with respect to terms like "efficient measures" or sentences like "The experiment results show that a large proportion of graph measures are redundant, in terms of that they do not add value to describe RDF graphs." Especially because of the very general introduction and motivation and the usage of general graph measures it is important to point out that the paper only looks at the comparison of RDF graphs with other RDF graphs. Some statements tend to create the impression that the authors look for measures that are characteristic for RDF graphs in general. However, from my understanding of this paper it is not the case and not the aim of the authors.
- I am missing a little bit of a discussion about the full materialization of RDF graphs. Although the authors stress several times that RDF graphs are different to other graphs since they come with A and T-box information, it would be interesting to follow this thought further. Wouldn't a full materialization of an RDF graph influence some of the measures a lot (e.g., m_p)? I assume that the graphs have been used in the form they have been published. However, in a future work, it might make sense to find a "normal form" for RDF graphs that either enforces the materialization or the removal of redundant information which could be inferred.
- Table 1 lists the graph measures while Table 3 lists the RDF graph measures. The separation of both makes sense to me. However, I do not understand why Table 3 is not part of Section 3 (e.g., as sub section 3.3). Section 4 should focus on the setup of experiments while Section 3 should already define all the measures that are used.
- I think it is good to mention libraries that have been vital for the implementation. It simply values the effort of the open-source developers that put a lot of effort in those libraries. However, I think a research paper is not the right place for promoting sentences (footnote 11 and 12). I would highly suggest to change the sentence "In order to ..." and the two footnotes to something like "We optimized the performance using external Python libraries\footnote{Our implementation mainly relies on numpy \url{link to numpy} and panda \url{link to panda}.}"
- In Equation 20, the usage of a function called "threshold" is very unfortunate since it does not fit to the meaning of the function. The "3" is the threshold in this equation. The function counts the number of statistical tests that identify the given metric as meaningful. I would suggest using a different name for this function, which fits to its meaning (e.g., "test(m,F)").
- I am wondering why the authors do not discuss the quality of labeling of the LOD cloud. For example, wouldn't it be possible to take the results in figure 4 as a hint that some of the domains used in the LOD cloud are too coarse grained? Linguistics seems to be a very fine grained category with low variances. The same seems to hold for publications and geography. However, something like cross domain seems to contain a high amount of diverse data and, hence, might have an influence on the results in the classification experiments.