Review Comment:
The main goal of the paper is proving a sort of Clustering hypothesis on ranking the results of [simple] queries on LOD.
This conjecture does not seem to be fully empirically supported with incontrovertible [and stable] results.
Interestingly this work has also some sideproducts: a simple method + ad hoc measures and a benchmark to assess the quality of LOD
The simple methods and measures are borrowed and adapted from the literature [e.g. from IR].
Some evident limitations that have been also highlighted in the paper should be addressed.
The minimal goal should be to achieve more stability in the results.
The paper is well written and organized.
Relevant literature on similarity and ranking from the machine learning field is lacking while it could improve the discussion.
A lot of work likely comes from [30] and [31.
I'd suggest to explicitly highlight the novel contribution of this paper.

COMMENTS / ISSUES
There's a huge literature on this kind of problems in machine learning and, more specifically the ILP subfield.
This related work is scarcely cited and discussed.
Most of ML methods can be regarded as characterized by the optimization of some objective function; when a closed form is not available such a kind of problems is generally solved through some heuristic search of suboptimal solutions.
It is to be better justified why to borrow heuristics from IR when the target representation of the data in structured.
But, in the scenario of the paper, the authors are dealing with structured data represented by means of technologies such as the Linked Data, whose specificity does not seem to be addressed directly (e.g. the reasoning level).
The Ugly Duckling theorem [DHS] asserts that classification is impossible without some sort of bias. Similarity has many forms, the choice of a specific measure for a given concept biases the solution to an instance classification problem.
One way to take into account a notion of context for a query may be the one adopted in [dFEL], where the concepts of interest may be elicited from the query itself.
There are many related similarity measures that are related to the context and can be considered as general forms of the simple one used in the paper, e.g. see [BWH; dSF].
Also kernel functions designed for knowledge representations spanning from RDF to OWL (e.g. see [BS; FdE; FdE2]) may be regarded as similarity measures and would deserve at least a discussion [if not a comparison]
The similarity measure in Def. 1 is quite simple and does not seem to require any form of reasoning [instancechecking] but only a lookup on the model graph. It also requires the Unique Names Assumption for counting.
An easy extension that may be taken into account is adding also the similarity of o_1 and o_2 (under measurement) as fillers of the same subject, say s, for a given predicate p.
As mentioned before, this measure is absolute, yet it is plausible to assume that individuals that may be required to be similar w.r.t. a certain query could be better treated as dissimilar w.r.t. a different query. Again, this is a typical feature selection problem that is worth discussing in the paper.
Why is the Estimate Quality of an answer defined in absolute terms instead of percentages?
Queries considered in Def. 2 are limited to those with one answer variable. Discuss this limitation.
Search space topology: The Cluster Hypothesis in Def. 3 depends on the sparseness of the data. It works for concept queries having answers in a densely populated region of the graph but in case of more sparse data (scarcely populated regions) this assumption may lead to cumbersome situations.
The involvement of humans in the loop has to be more strongly justified: the translation of NL queries into a formal form may represent itself a difficult step (see sect. 3.3).
Matching hits on the ground of rdfs:label seems a bit adverse to the nature of the structural representation of the Linked Data.
The applicability of the method in case of limited or sparse labeling has to be taken into account.
The adopted DCG measure seems to be a better choice over other standard indices only if the results are ranked enumerations.
When results are mere set element enumerations there are many reasons errors may occur [an individual is not a member of the target set].
In absence of an explicit additional criterion all errors should be judged as equally wrong
[So why "The ideal ordering would have had the answers Ola Brunkert and Stig Anderson swapped" ?]
In def. 4: is the value rDCG different in the formula for nDCG different for each call?. Are you hinting some averaging over the possible random sequences?
The same questions arises for rMD in Def. 5.
Linear scaling cluster heuristic scores makes a strong assumption that must be justified. [(Multidimensional) Scaling is a form of clustering in itself]
The assumption tested is that answers to any query can fit a linear scale of 5 degrees.
Discerning correct/incorrect answers may need in principle only 2 clusters, but there may occur distributions of results that are naturally clustered in larger number of groups.
Queries considered in the paper are require instancechecking only.
SPARQL queries are more general.
Solutions to this limitation do not appear to be sufficiently discussed in the paper.
The variance in the results seems to be the main problem affecting the significance of the conclusions that can be drawn,
I am sure that resorting to more elaborate similarity measures / clustering methods [e.g. exploiting the data semantics] could improve the stability of the results.

MINOR ISSUES
 Tab. 1 caption: "asterix" > "asterisk"?
 p. 5: "[31]" in 3.1 and 3.2: I'd avoid using citations of papers as subjects for sentences.
 Tab 3: the min and max bars appear a bit confusing
[you may use a different way to represent the std. dev. wrt the avg values for example]
 Fig.3 the proposed measures may behave quite differently;
e.g. consider queries 45 and 46 where quite different MD scores were observed w.r.t. an equal max nDCG score.

BIBLIOGRAPHY
[dFEL] d'Amato, Fanizzi, Esposito & Lukasiewicz: Representing Uncertain Concepts in Rough Description Logics via Contextual Indiscernibility Relations. URSW (LNCS Vol.7123) 2013: 300314
[DHS] Duda, Hart & Stork: Pattern Classification, 2nd Ed. 2000
[BWH] Borgida, Walsh & Hirsh.: Towards Measuring Similarity in Description Logics. Proceedings of the International Workshop on Description Logics DL2005, volume 147 of CEUR Workshop Proceedings, CEURWS.org, 2005.
[dSF] d'Amato, Staab, Fanizzi: On the Influence of Description Logics Ontologies on Conceptual Similarity. EKAW 2008: 4863
[BS] Bloehdorn & Sure: Kernel Methods for Mining Instance Data in Ontologies. Proc. ISWC/ASWC 2007: 5871
[FdE] Fanizzi, d'Amato & Esposito: Learning with Kernels in Description Logics. Proc. ILP 2008: 210225.
[FdE2] Fanizzi, d'Amato, Esposito: Induction of robust classifiers for web ontologies through kernel machines. J. Web Sem. 11: 113 (2012)
