Characterizing RDF Graphs through Graph-based Measures - Framework and Assessment

Tracking #: 2529-3743

Matthäus Zloch
Maribel Acosta
Daniel Hienert
Stefan Conrad
Stefan Dietze1

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
The topological structure of RDF graphs inherently differs from other types of graphs, like social graphs, due to the pervasive existence of hierarchical relations (TBox), which complement transversal relations (ABox). Graph measures capture such particularities through descriptive statistics. Besides the classical set of measures established in the field of network analysis, such as size and volume of the graph or the type of degree distribution of its vertices, there has been some effort to define measures that capture some of the aforementioned particularities RDF graphs adhere to. However, some of them are redundant, computationally expensive, and not meaningful enough to describe RDF graphs. In particular, it is not clear which of them are efficient metrics to capture specific distinguishing characteristics of datasets in different knowledge domains (e.g., Cross Domain vs. Linguistics). In this work, we address the problem of identifying a minimal set of measures that is efficient, essential (non-redundant), and meaningful. Based on 54 measures and a sample of 280 graphs of nine knowledge domains from the Linked Open Data Cloud, we identify an essential set of thirteen measures, having the capacity to describe graphs concisely. These measures have the capacity to present the topological structures and differences of datasets in established knowledge domains.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Gëzim Sejdiu submitted on 07/Aug/2020
Review Comment:

The new version submitted by the authors has addressed most of the issues raised by the reviewers in the first round. I appreciate the effort authors have done to improve the content and tightening the writing. The manuscript has improved a lot and is now well-organized and reads generally well.

I have carefully studied the new version, particularly w.r.t. my comments about the paper and am satisfied with the improvements made.

So, I am happy to recommend accepting the paper in its current form.

Review #2
By Michael Röder submitted on 19/Aug/2020
Review Comment:

The paper is an extension of [2] and aims to "identify a set of meaningful, efficient, and non-redundant [(RDF) graph] measures, for the goal of describing RDF graph topologies more accurately". The authors further define that an "efficient" measure should be discrete with respect to other measures and should add an additional value in describing an RDF graph (in comparison to other RDF graphs). The authors rely on two types of majors: general graph measures taken from [2] and RDF graph measures from [3]. The authors introduce the different measures before answering three research questions:
- Which measures does the set M' of efficient measures for characterizing RDF graphs comprise?
- Which subset of M' (M''_c) characterizes RDF graphs of a certain domain c?
- Which of the measures in M' show the best performance in classification tasks that aim to discriminate RDF datasets with respect to their domains.
The authors answer these three questions in an empirical way using 280 RDF datasets of the LOD cloud. They find 29 out of 54 evaluated measures to be efficient (i.e., these measures are within M'). 13 of these measures are identified to have an impact on distinguishing datasets of different domains from each other. In addition, the most important features per domain are determined.

=== Positive aspects

+ The research questions the paper focuses on are very important for various research fields related to RDF graphs.
+ The approach the authors apply makes sense to me. Of course, some details might be arguable. However, it is a complex work and this automatically leads to a lot of different possibilities and decisions that have to be made by the authors.
+ The article is an extension of [2]. However, the authors clearly distinguish the two articles from each other and list the extensions made.
+ The insights the authors point out are valuable and can become important for the community.
+ It is very good that the authors exclude the three domains that had a low number of RDF datasets from their per-domain experiments.
+ The authors made the detailed analysis results as well as the framework they used for the analysis available.
+ The paper is well written.

+ The paper is a revision of swj2446. From my point of view, the authors fixed all the issues that have been identified or gave good arguments why they won't follow a reviewer's suggestion.

=== Writing Style
The paper is well written.

- Page 6, right column, line 33: It is preferable to use footnotes at the end of the sentence (unless there are several footnotes within a single sentence). At the moment, an (inattentive) reader could misunderstand the "$C_d$\footnote" as $C_d^6$. I would suggest writing it as "$C_d$.\footnote".
- Table 4, footnote: "Compressed archive containing multiple RDF files which need to be merged" --> Either there is a comma missing ("... files, which need ...") or "that" should be used ("... files that need"). In this case, I would suggest the latter solution since the relative clause is important to define the word "files".
- Table 5: the table has a slightly different formatting than the others (i.e., the top and bottom lines are missing).