Review Comment:
The paper is an extension of [2] and aims to "identify a set of meaningful, efficient, and non-redundant [(RDF) graph] measures, for the goal of describing RDF graph topologies more accurately". The authors further define that an "efficient" measure should be discrete with respect to other measures and should add an additional value in describing an RDF graph (in comparison to other RDF graphs). The authors rely on two types of majors: general graph measures taken from [2] and RDF graph measures from [3]. The authors introduce the different measures before answering three research questions:
- Which measures does the set M' of efficient measures for characterizing RDF graphs comprise?
- Which subset of M' (M''_c) characterizes RDF graphs of a certain domain c?
- Which of the measures in M' show the best performance in classification tasks that aim to discriminate RDF datasets with respect to their domains.
The authors answer these three questions in an empirical way using 280 RDF datasets of the LOD cloud. They find 29 out of 54 evaluated measures to be efficient (i.e., these measures are within M'). 13 of these measures are identified to have an impact on distinguishing datasets of different domains from each other. In addition, the most important features per domain are determined.
=== Positive aspects
+ The research questions the paper focuses on are very important for various research fields related to RDF graphs.
+ The approach the authors apply makes sense to me. Of course, some details might be arguable. However, it is a complex work and this automatically leads to a lot of different possibilities and decisions that have to be made by the authors.
+ The article is an extension of [2]. However, the authors clearly distinguish the two articles from each other and list the extensions made.
+ The insights the authors point out are valuable and can become important for the community.
+ It is very good that the authors exclude the three domains that had a low number of RDF datasets from their per-domain experiments.
+ The authors made the detailed analysis results as well as the framework they used for the analysis available.
+ The paper is well written.
=== Minor issues
The paper has some minor issues that need to be fixed before it can be published.
- In Section 3, the authors give formal definitions of 28 graph measures. However, the chosen variable and function names need to be fixed.
-- Several letters have more than one meaning:
--- Some of them are used as value and as a function at the same time. For example, d is the maximum total degree of vertices within a graph _and_ the function that returns the degree of a single vertex. The same holds for d^+, d^- (see Page 5, 2nd column, lines 6-7) and PR (Equation 11).
--- p is used as a single property and as the fill measure (Def. 3.2 vs. Equation 12).
--- C is used for centrality measures (e.g., C_d) and for the domains of the LOD cloud (Section 3.2.3 vs. 4.1).
--- m is used for the number of edges of a single graph and for a single metric (Equation 2 vs. Equation 20).
-- The ^+ is used with two meanings. For d^+, z^+, C_{d^+} and others, it means that the measure focuses on incoming edges. However, the measure C^+_d uses it, without the focus on a directed graph since the measure uses solely values of the undirected graph representation. I would strongly suggest removing the ^+ and using something else to distinguish between the point and the graph centrality.
-- The authors follow a common usage of variable writings, i.e., capital letters are sets while lowercase letters are used for variables and functions. However, the authors do not seem to strictly stick to that (e.g., C_d denotes a single number instead of a set). I would highly suggest to try following a clearer usage of variable names.
-- In Equation 21, the authors introduce the usage of a bar on top of val(m,D_c) to express that the mean of all val(m, x \in D_c) is used. However, the bar notation has not been used before. Since the authors also use "std" and "val" as functions within the equation the usage of "avg" as an average function would be better.
- The authors select a set of 54 metrics. Of course, there are much more metrics available. Hence, the criterias why one metric has been selected and another metric is not selected is important. Here, the paper needs to be improved since some of the (implicit) decisions are not clear to me.
-- The authors define z, z^+ and z^- as the mean total-, in- and out-degree of a graph G. Here, I am missing a clearer definition. Is it the arithmetic mean? If yes, does it make sense to have z^+ and z^- since they should give the same number as z (z = m/n = m^-/n = z^- with m^- being the number of edges that are outgoing edges, which is exactly the number of edges in an RDF graph)? (Shouldn't be confused with the subject out-degree and the object in-degree of [3], which give different values since they are normalized using the number of subjects and number of objects, respectively) If it makes sense to have z^+ and z^-, why are they missing in Figure 1? I can only see avg_degree in the Figure.
-- Why several measures seem to be missing that (should) have been there to complete the set of measures? If the authors have a good reason, they should explain it (Please excuse it the reason is obvious, it does not seem to be obvious to me). If the reason can be found in a different paper, it should still be stated or the authors should explicitly refer to it.
--- h and h^+ are used but h^- is missing
--- \alpha and \alpha^+ are used but \alpha^- is missing
--- \sigma, \sigma^2 and cv are missing in Table 1 although they have been defined in Equations 17-19.
--- C_{d^+} and C_{d^-} are listed in Table 1 although they have never been defined in the text.
- The authors try to find a set of measures that show the best performance in classification tasks that aim to discriminate RDF datasets with respect to their domains. In Section 5.3, the authors present the results of the classification and focus on the measure importance for the different tasks. However, I am missing the accuracy / F1-measure of the classifier. Yes, I got that "the primary aim was not to find the best classification model". However, without these values, it seems to remain questionable whether the importance of single features for a classifier is really interesting (if the classifier may not lead to meaningful results). Especially since the authors claim that they achieved an "acceptable prediction accuracy" and that their "experiment showed that, by employing topological measures, the prediction of categories for datasets is possible". From my point of view, the paper does not show that as long as the accuracy values are not reported.
=== Writing Style
The paper is well written.
- Def. 3.1.: "A multigraph G" --> "A directed multigraph G"
- Page 8, left column, line 10: \alpha_{in} --> \alpha^+
- Equation 9: it is confusing that d is defined again although it is defined and used several times before.
- Table 1: The style guide suggests to avoid vertical bars in tables. In addition, it might be easier to sort the measures in a different way: putting the total, in and out measure in one row makes the table shorter and still easy to read (from my point of view).
- Not all URLs seem to be defined as links (e.g., footnote 1).
- Page 11, left column, line 15: "outperforms other established classification" --> "outperforms other established classifiers" or "outperforms other established classification algorithms"
- Page 11, right column, line 1: "experiment" --> "experiments"
- Page 11, right column, line 1: "graph-object after loading into memory" --> "graph-object after loading it into memory"
=== Paper References
- The paper has 58 references. However, it seems like only the first 37 references are used. I am not sure why the authors have additional 21 unused references in their bibliography. This has to be fixed.
- [54] has been cited as an arxiv.org paper although it has been published as a peer-reviewed paper.
- [58] has faulty page numbers.
=== Comments
This Section collects some comments that are not crucial to follow. However, they may increase the quality of the paper.
- I think the authors could be more precise in their formulations with respect to terms like "efficient measures" or sentences like "The experiment results show that a large proportion of graph measures are redundant, in terms of that they do not add value to describe RDF graphs." Especially because of the very general introduction and motivation and the usage of general graph measures it is important to point out that the paper only looks at the comparison of RDF graphs with other RDF graphs. Some statements tend to create the impression that the authors look for measures that are characteristic for RDF graphs in general. However, from my understanding of this paper it is not the case and not the aim of the authors.
- I am missing a little bit of a discussion about the full materialization of RDF graphs. Although the authors stress several times that RDF graphs are different to other graphs since they come with A and T-box information, it would be interesting to follow this thought further. Wouldn't a full materialization of an RDF graph influence some of the measures a lot (e.g., m_p)? I assume that the graphs have been used in the form they have been published. However, in a future work, it might make sense to find a "normal form" for RDF graphs that either enforces the materialization or the removal of redundant information which could be inferred.
- Table 1 lists the graph measures while Table 3 lists the RDF graph measures. The separation of both makes sense to me. However, I do not understand why Table 3 is not part of Section 3 (e.g., as sub section 3.3). Section 4 should focus on the setup of experiments while Section 3 should already define all the measures that are used.
- I think it is good to mention libraries that have been vital for the implementation. It simply values the effort of the open-source developers that put a lot of effort in those libraries. However, I think a research paper is not the right place for promoting sentences (footnote 11 and 12). I would highly suggest to change the sentence "In order to ..." and the two footnotes to something like "We optimized the performance using external Python libraries\footnote{Our implementation mainly relies on numpy \url{link to numpy} and panda \url{link to panda}.}"
- In Equation 20, the usage of a function called "threshold" is very unfortunate since it does not fit to the meaning of the function. The "3" is the threshold in this equation. The function counts the number of statistical tests that identify the given metric as meaningful. I would suggest using a different name for this function, which fits to its meaning (e.g., "test(m,F)").
- I am wondering why the authors do not discuss the quality of labeling of the LOD cloud. For example, wouldn't it be possible to take the results in figure 4 as a hint that some of the domains used in the LOD cloud are too coarse grained? Linguistics seems to be a very fine grained category with low variances. The same seems to hold for publications and geography. However, something like cross domain seems to contain a high amount of diverse data and, hence, might have an influence on the results in the classification experiments.
|