|Review Comment: |
Dear authors, thank you very much for your precise feedback to the reviewers. The paper has matured further; all of the following is basically just nit-picking, which doesn't influence my clear recommendation to accept.
First, about your specific response to my concern regarding the expansion on "semantic heterogeneity". Your extended coverage now explains more appropriately why you consider inference on relational databases to be out of the scope of your work. Indeed, "calculations with an equivalent effect to some forms of logical inference", that's what I had in mind with my review of the previous version. I'm not sure whether your claim that "such applications of these features are not very common in practice to the best of our knowledge" is really true. Ideally you would provide some evidence based on, e.g., textbooks or survey papers from the field of relational databases. But in any case I would not consider it reasonable to ask you to provide even harder evidence by examining actual relational database data that include views, triggers, stored procedures, etc., as, in contrast to, say, LOD, they are close to impossible to find anywhere on the Web.
Commenting on your discussion with Reviewer 3 on the definition of "quality" I'd like to argue that there is no contradiction between defining "mapping quality as mapping utility w.r.t. a query workload posed against the mapped data" and "the notion
of multi-dimensional quality that is also frequently used in the literature". (BTW, while  and  are reasonable references to cite here, as they talk about mappings and build on this multi-dimensional definition of quality, but the more appropriate source for that definition is another reference that you have already, i.e. .) Is "utility w.r.t. a workload" actually a unidimensional measure, or are there really multiple aspects of it? In fact, Section 4.7 claims to introduce one single scoring function, but actually you are already defining two: "We […] observe a score that reflects the utility of the mappings […]. Intuitively, this score reports the percentage of successful queries for each scenario. However, in a number of cases, queries may return correct but incomplete results, or could return a mix of correct and incorrect results. In these cases, we consider per-query accuracy by means of a local per-query F-measure. Technically, our reported overall score for each scenario is the average of F-measures for each query test, rather than a simple percentile of successful queries." This could be seen as your "overall score" being a metric that aggregates two more basic metrics.
I consider your new Section 5.3 quite useful; it gives a clear impression of what it feels like to use RODI in practice. Thank you also for making Section 5.2 easier to understand thanks to examples.
* Section 5.2: misspelling: "interger"
* Reference : space missing between "Mappings" and "to".