Querying Biomedical Linked Data with Natural Language Questions

Tracking #: 966-2177

Thierry Hamon
Natalia Grabar
Fleur Mougin

Responsible editor: 
Guest Editors Question Answering Linked Data

Submission type: 
Full Paper
A recent and intensive research in the biomedical area enabled to accumulate and disseminate biomedical knowledge through various knowledge bases increasingly available on the Web. The exploitation of this knowledge requires to create links between these bases and to use them jointly. Linked Data, SPARQL language and interfaces in Natural Language question-answering provide interesting solutions for querying such knowledge bases. However, while using biomedical Linked Data is crucial, life-science researchers may have difficulties using SPARQL language. Interfaces based on Natural Language question-answering are recognized to be suitable for querying knowledge bases. In this paper, we propose a method for translating natural language questions into SPARQL queries. We use Natural Language Processing tools, semantic resources and the RDF triples description. We designed a four-step method which linguistically and semantically annotates the question, performs an abstraction of the question, then builds a representation of the SPARQL query and finally generates the query. The method is designed on 50 questions over 3 biomedical knowledge bases used in the task 2 of the QALD-4 challenge framework and evaluated on 27 new questions. It achieves good performance with 0.78 F-measure on the test set. The method for translating questions into SPARQL queries is implemented as a Perl module and is available at http://search.cpan.org/~thhamon/RDF-NLP-SPARQLQuery/.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Anca Marginean submitted on 26/Feb/2015
Minor Revision
Review Comment:

The paper is an extended version of "Description of the POMELO System for the Task 2 of QALD-4" CLEF2014, Working Notes.
It describes a system for querying linked data in natural language. The system uses a four-step method which creates linguistic and semantic annotations of the question, performs an abstraction of the question, builds a representation of the SPARQL query and generates the query. The system was evaluated with the questions over the biomedical datasets that were used in task 2 of the QALD-4 challenge together with another new 27 questions. It obtained a good performance with 0.78 F-measure on the set of new questions.

Compared to the article from Working Notes (CLEF 2014), the paper adds the section Related Work and the new set of 27 questions. Originality of the system could be better emphasized by a better comparison to other similar approaches for querying linked data with patterns, like Quepy - a python package based on patterns from Machinalis, or Intui3 from Corina Dima - "Answering natural language questions with Intui3" (Working Notes CLEF2014).

The quality of writing is good, but I think it can be improved.
In my opinion, a more clear perspective over the architecture of the system can be obtained if Fig.1 focuses on the modules of the system and the resources used by the modules (in terms of APIs), instead of steps. As it is represented now, it seems more a workflow representation than the representation of an architecture. Furthermore, the "Number" and the datasets Sider, Diseasome and DrugBank, are all represented as resources in the pre-processing phase, even though information is extracted from the datasets, while no information is extracted from the resource "Number". The relation "Number", datasets and "linguistic and semantic annotation" step is clearly expressed in text, but maybe it can also be better represented in the image. The caption of the figure 1 mentions colors, but the image does not include any colored boxes.

The system uses rewriting rules. A formal description of these rules would improve the clarity of the paper.
For example, the phrase from section 4.1 "We exploited the documentation of this resource to define the rewriting rules and regular expressions for the named entity recognition" could be more clear if the authors gives more details about the way these rules are defined (even though section 3.2 describes some contextual rewriting rules).
More parsing patterns are mentioned in section 3.1 that were defined manually in a previous work. In my opinion, more details about these patterns could help the reader understand the annotation step.

Overall, I think formal representations, maybe in terms of pseudocode for algorithms or function-based representations, could improve the clarity of the descriptions for all four steps of the used method.

Other observations are:
- the example from Figure 3 includes a different question than the one considered in the explanation from sections 3.2, 3.3. Many explanations from Section 3 make reference to the example from figure 2 of the previous article of the authors "Description of the POMELO System for the Task 2 of QALD-4".
Furthermore, in section 3.1, it is stated "Figure 3 illustrates the linguistic and semantic annotation of questions", but the linguistic annotations seem to be omitted.
- all SPARQL queries are built with the goal of querying linked data, so the separation proposed for related works in section 2.1 and 2.2 is in my opinion unjustified.
- in section Related Work, the second paragraph begins with "another possible distinction", but is not clear enough which is the first distinction.
- the phrase "A method based on modular patterns is designed to parse questions [21]." seems to be incomplete.
- in Conclusions, the authors express the intention to investigate how to automatically build the dedicated resources from the RDF schemas. Is there a way to rely for this on the new lexicalization layer proposed for ontologies in models like lemon (McCrae, J., Spohr, D., and Cimiano, P. (2011). Linking Lexical Resources and Ontologies on the Semantic Web with Lemon)?

In conclusion, I recommend minor revisions.

Review #2
By Jin-Dong Kim submitted on 26/Mar/2015
Review Comment:

The manuscript presents a SPARQL generation system from natural language questions. The design of the system looks reasonable and the performance is reported on the Biomedical task of the QALD-4 challenge.

While the design of the system itself is interesting, the biggest problem of the paper is that it does not prove anything and it is unclear what is original contribution of the work. The manuscript reads more like a technical report than a scientific paper.

As the authors listed in the section, Related work, there are already a number of previous works which implemented similar systems, and it is clear that the presented system shares similar features with other works. However, in the manuscript, except the section Introduction and Related work, there is almost no citation to relevant works.

In the end of the section 3 (before 3.1), it says "Our approach is close to the one proposed by (Unger et al.) ... The main difference is that we use information issues from the linked data resources ...", which is the only clue for originality of the presented work. However, there is no experimental result to show how the original part contributes to the overall performance.

Overall, the manuscript fails to show what is, and how significant is the original contribution of the presented work.

[Minor comments]

- (section 3.2) The manually developed rewriting rules look quite specific to the target data sets, which would make the system hard to be generalized to other data sets. Then, it is a clear limitation of the presented approach.

- (section 3.2) The section should cite previous works about template generation, e.g., Unger et al., and compare the approaches.

- (section 3.2) There is no definition of "Question topic". Why "drug" is the question topic of the example query? Isn't it "side effect"?

- (section 3.2) It would be good if the step 3 were written with reference to other works for data set profiling.

- (section 3.2) In the explanation of step 3, where is the predicate "state" in the example query?

- (section 3.3) The query construction process should be explained more precisely. e.g., please present the algorithm.

- (Fig 5 and 7) It is hard to understand the figures. Please give a detailed explanation.

- (section 4.2) What if a data set does not come with a RDF schema?

- (Figure 10) The color scheme is hard to read particular for those with B/W printers. Please use patterns.

- (section 6) In the end of the section, the authors list some further works. However, at least "limitations" should have been addressed in the manuscript rather than leaving for future works.

- Overall, the manuscript should get through a native English check.

Review #3
Anonymous submitted on 25/Apr/2015
Major Revision
Review Comment:


In this paper, the authors present a method for translating natural-language questions into SPARQL queries that can be used against biomedical knowledge bases containing linked data in order to retrieve answers. The topic of the paper is relevant to the Semantic Web Journal in general and to the special issue on Question Answering over Linked Data in particular.


Question answering over linked data, in particular, over biomedical knowledge bases, is a topic of theoretical and practical significance. The methodology presented in the paper represents a contribution in this regard.


Overall, the paper is relatively well-organized, with appropriate Introduction, Conclusion, and Related Work sections besides the three main substantive sections where the authors describe and discuss the question translation methodology, the semantic resources used in the translation process, and the evaluation experiments and results. The paper also includes an adequate, albeit not extensive, number of relevant references. However, the related works discussed and the references included are mainly focused on question translation and question answering over linked data. Considering that the authors describe the application of their methodology to the domain of biomedical question answering, at least some discussion/inclusion of domain-relevant issues/references is deemed appropriate.


The paper contains appropriate figures to help describe and illustrate the proposed methodology. In particular, the use of detailed figures that describe the proposed natural-language-to-SPARQL conversion process at each progressive step aligns well with, and facilitates the understanding of, the textual description of the process. However, presentation could be further improved by including an end-to-end illustration of an example case showing the intermediate results of each processing step. Also, it would be appropriate to attach an appendix listing natural-language questions in the test set and the (correct/incorrect) results of their translation into SPARQL queries using the methodology as well as the answers retrieved by using the (correctly translated) queries.

Language & Writing Style:

Although the language could be slightly improved (by consistently using grammatically correct and idiomatically appropriate expressions, e.g., translation into vs. translation in), the paper is overall readable and comprehensible.

Technical Content:

Overall, the proposed methodology seems sound, and the authors describe the methodology relatively clearly and in detail. However, some steps of the translation process using the methodology seem dubious/ambiguous or at least not presented with sufficient clarity.

For example, on p.6, in the right column, in describing the identification of predicate and argument for a given example question ("What [are] the side effects of drugs used for Tuberculosis?" in Fig. 3), the authors write: "In the example from Figure 3, the predicate state with the expected ar-guments drugbank/drugs and Gas/String is recognized." However, it is not clear how the predicate "state" is identified nor is it obvious that "state" is the most appropriate predicate given the semantic content of the given question. Even more puzzling is the "Gas/String" part, which at best would describe the expected answer and its data type but seems out of place in the given context of description.

The aforementioned example also illustrates what seems to be generally lacking in the paper, namely, some effort into the semantic analysis and classification of the characteristics of biomedical questions (and corresponding answers). For example, the example question above could be abstracted as a question of the cause-effect category, with the canonical form -causes-, which in this case asks for the side effects caused by the drugs used for Tuberculosis.

Minor Correction:

The beginning of the caption for Fig. 1 reads: "The global architecture of the system (the processing steps are in yellow, the resources in blue)." However, the figure is rendered in plain black-and-white and is not colored as described.

Suggested Improvements:

The authors are encouraged to revise the paper, both in content and form, taking the reviewer’s comments into consideration.