Review Comment:
This paper presents an hybrid semantic search engine to retrieve and combine both facts from semantic sources and documents as results.
The paper is well written and the topic is very relevant. I believe combining results from the IR and semantic fields is the way forward for search engines. The hybrid approach presented here combines structured and unstructured information to answer user queries. This isn’t novel and there is a lot of state of the art on hybrid semantic search engines, even from commercial companies such as google, which uses Google Knowledge Graph to improve the results and even to give straight answers to users queries if any. However, in this paper a user based evaluation is presented. This evaluation focuses on important aspects such as user acceptability / usability and efficiency in terms of better answering user information needs, in a way that is meaningful and that can be understood by users with no semantic background.
The authors also did a good work presenting the state of the art. They claim the novelty of the proposed approach is that structured and unstructured data are combined through the entire search process, while most approaches (such as Google KG) seem to present the facts and concepts independently from each other. Hybrid semantic search approaches, as defined in this paper, either include the partial results of one retrieval method into another (e.g., PowerAqua) or perform reconciliation at the end between matched facts and documents (K-Search), SINFIO claims to perform fact and semantic retrieval “interdependently”, to accept as input formal, informal and hybrid queries (basically through semantic autocompletion?) and in addition it can present results as facts, documents, or hybrid. This is shown nicely in Table 1.
In the architecture section (3.1) it isn’t clear to me what performing facts and document retrieval “interdependently” really means, this should be made a bit more explicit, i.e., does it means that semantic entities are used to find documents (ie. expanding the search / query expansion) and documents to find semantic facts ? or that both the entity search and the document index search are performed at the same time (using and combining partial / intermediate results ) instead of one after another?
To close the gap between the user and system knowledge a semantic autocompletion component is used to support the user create queries with as many formal parts as possible, to do that the formal knowledge is extended with lexically related words from WordNet. If the user does not choose one of the recommendations of the autocompletion , n-gram matching is applied to match a query term agains the KB (in this case Dice-Similarity is computed for the n-gram values).
In my view the key contribution of this paper is the user based evaluations that compare the user acceptability on the hybrid results with respect to a standalone fact or document retrieval, ie. probing the hypothesis that the complexity of the search process can be hidden from the user, while at the same time providing novel content that can be meanigfully understood by the users to answer user information needs more efficiently
Regarding the selected approach, a triple based fact retrieval and a graph-traversal (semantic activation) based semantic document retrieval algorithm have been chosen as methods to combine both fact and document retrieval. First label and synonyms of the matched resources in the query are used for query expansion. Then, this first set of ranked resources and documents constitute the starting point for the hybrid algorithm, if the fact retrieval is successful, a hybrid semantic search is performed, otherwise just a semantic document retrieval. Fact retrieval limited to two hops to avoid relevant inferences. In the case of 1 term query all resources and triples dependant on the resource the terms was (lexically) matched to are returned, if there is more than one term it iterates over two adjacent terms till all query terms are matched or no more terms match to existing triples, so as result a subgraph of triple sets is identified, where each triple is connected to another by the same subject or the same object (excluding classes).
The paper will benefit from a more detailed example on how the ranking is done, so it will be easier to follow the explanation on 3.2.3 and 3..3, eg., following the style of the example in figure 7 to see the weights assigned to the various nodes / edges.
The matched resources are then used the expand the query and perform a keyword search in the document ranking, in other words the entities found are used for the document search, why is this “interdependent” (e.g, the document search does not influence the entity search)? . Also, if I understood well, the entities found for the intermediate steps, which are not final answers (e.g., “I love Trouble” a movie starring Julia Robert but no directed by Garry Marshal), are used to find documents, if so what is the impact of these approach in terms of precision / recall (vs. just using the query terms and the final set of answers once all terms in the query are considered)?
In the examples Figures 8-12 it will be good to show what is the user input query.
The evaluations are generally thorough and convincing , and I think this paper will potentially get many citations because there aren’t many papers that perform user evaluations and consider usability, and that’s why for me this is the key contribution. There are however some limitations that perhaps should be mentioned and discussed:
- only 20 NL queries are randomly selected from DBpedia logs, that is a very limited sample, these are also facts queries so it may bias the results presented here, eg., for queries that are a mix of facts and document base queries (ie not fully covered by DBpedia) the users may find the hybrid approach less adequate ..
- with respect to IR measures SINFIO is compared against just fact or document retrieval. This is OK to evaluate that the hybrid approach improves over an standalone approach, but the drawback is that it is difficult to asses what is the efficiency of the present approach with respect to other search engines / state of the art.
Basically, while the approach present here is interesting, there are are some design choices made, like for example: 1) when building the subgraph of connected triples by subjects / objects, connections by the same classes are excluded , 2) terms in the graph are searched by pairs, based on the adjacent term in the sentence, eg and (instead of ). The design choices are sufficiently justified in the paper but it is not known what is their impact of precision / recall. For example, properties are notoriously difficult to match and often ambiguous when users pose questions in NL, however to find the intermediate answers properties are used such as “director” with “Garry Marshall” in Figure 7. If no results are returned because the property could not be matched the combination “film” and “Garry Marshall” is not further explored . It is difficult to assess the impact of these design choices on precision / recall because there is no other system to compare to.
While I do not expect the authors to include another evaluation as their main purpose is to validate the hybrid approach with respect to a non hybrid one, I would like to know why they didn’t use known evaluation benchmarks and gold standards, such as QALD (if limited to fact queries with no aggregations, comparison, etc) or any of the freebase evaluation standards like simple queries. That would have made easier to see if the performance of SINFLIO is “comparable” to other NL based QA systems, at least in terms of fact retrieval for questions (even if the results across systems are not really directly comparable, but just to get an idea of what is the P/R achieved using a well known gold standard).
- Could you publish the 20 questions used in the evaluation?
- Could you comment on the performance of this approach in terms of the average time taken by the system to provide results to the users for the given queries ?
The hypothesis to be evaluated are nicely presented in 4.1 to guide the comparison of the hybrid vs the standalone approaches . Again the only issue I find here is that only 3 of 20 queries presented hybrid results, that is a very limited set to draw conclusions on the effectivity of the approach. Also, the P/R values presented in Table 2 are quite low. There is no baseline or other systems to compare too (besides the standalone semantic document / fact retrieval , which has even lower P/R values). I am not sure what to make up of these low values and I would like to see some discussion on this , why the F-measure for the semantic document retrieval is only 0.35? and why the precision in SINFIO seems to be so low (good recall but less than 0.5 precision)? how could this be improved? would a simple index document retrieval be more precise than a semantic document retrieval (with similar recall) ? is this because of the lack of methods to solve ambiguity when no facts are retrieved?
The authors mentioned two adverse points, mainly regarding the completeness of the answer, I do not see this as a big issue , as the interface could clearly show if there are more than 10 results and therefore allow the user to explore them (if I understood that well to me is more of a UI issue that can be fixed ), however the low precision value is an important issue to discuss here.
The semantic autocompletion validation is based on 5 predefined questions, could you explain how and why those 5 questions were selected? Here metrics like task duration are given too, but again I am not sure how meaningful these are without a baseline to compare to.
Despite the low number of queries, I find the validation on the users preference for the hybrid results to answer their information needs convincing, however, I am less convinced by the validation of the semantic autocompletion. Semantic autocompletion, while useful to avoid ambiguity, is known to be cumbersome when the knowledge bases to query are very large and ambiguous. This is because the user could easily be overwhelmed if there is a long list of candidate possibilities to complete their sentence or when their choice is either not there (or it is rank very low in the list). It also requires that the users express the query in a way that follows the ontology structure, which may imply having to reformulate the query several times. The 5 examples presented here may not reflect on this issue and I believe a large set of examples is needed for this validation, as well as to show more clearly the advantages and disadvantages of autocompletion. For example, how do the users react when autocompletion fails to present the right choice (in particular taking into account that the coverage of WordNet fo find lexically related words is limited, and that ambiguity introduces lost of noise) ?
In sum I do really like the topic of the paper, the directions taken to create and present an hybrid approach, and its really good to see that lot of attention has been taken to evaluate the usability aspects of the system with users. This is worth of publication, however the limitations should be noted and the discussion needs to be expanded on various aspects .
|