Question Answering over BioMedical Linked Data with Grammatical Framework

Tracking #: 1017-2228

Anca Marginean

Responsible editor: 
Guest Editors Question Answering Linked Data

Submission type: 
Full Paper
The blending of linked data with ontologies leverages the access to data. GFMed introduces grammars for a controlled natural language targeted towards biomedical linked data and the corresponding controlled SPARQL language. The grammars are described in Grammatical Framework and introduce linguistic and SPARQL phrases mostly about drugs, diseases and relationships between them. The semantic and linguistic chunks correspond to Description Logic constructors. Problems and solutions for querying the datasets with Romanian, beside English, are described in the context of Grammatical Framework.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Dana Dannells submitted on 03/Mar/2015
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

(1) originality,
The paper presents a controlled language system for querying biomedical linked data in GF. The system uses the three tagsets that were proposed in QALD-4, Task 2. It has been demonstrated in both English and Romanian.

The statement in Section 2. "GF libraries do not include resources for SPARQL", this is not quite right, a SPARQL resource was developed and released in the MOLTO project, see for example:

There has been a grammar implementation in the Biomedical domain before which is not mentioned in the article, Kuhn et al.: “Improving Text Mining with Controlled Natural Language: A Case Study for Protein Interactions”. In contrast to Kuhn et al. system the this is the first system that follows the Linked Data principles within this domain and applies it to Romanian. The originality aspect of the work presented here will be more convincing if some comparison with Kuhn et al. system is provided.

(2) significance of the results
The author propose an elegant system-architecture which seems to be rather robust. It successfully parsed all questions in the dataset given in QALD-4, Task 2, but the contribution of the paper is not very convincing. It is unclear how well does the system scale? what will be the effort needed to cover a larger amount of queries? and what will it take to port the system to a new domain?

To what extent does the grammar architecture follows the description logic family ALC, for example is it possible to include number restrictions such as "has only 5 side effects", how do restrictions on individuals are encoded, is it possible to express query such as "all drugs without side effect X" or "drugs with solubility of X and without solubility of Y"?

More specific questions concerning SPARQL statements are: How does the grammar deal with optional statements? Are there mechanisms to adjust the output with statements such as, ORDER BY, LIMIT X?

One of the decisions taken in Algorithm 1 (page 9) is the choice of tree "with minimal length", is the smallest tree always the right tree, i.e. the right analyse of the sentence? what is this assumption based on? A reference to previous work or a motivation for this choice could be added.

The author lists several limitations in the Romanian resource grammar (Section 4), were these due to direct translations from English? The number of terms extracted in Romanian is 1815, how accurate are these terms? were they evaluated?

(3) quality of writing.
The paper is clearly written and well presented. There is a clear description of the functions and the categories of the system.

Was the grammar released as a part of GF RGL? Figure 12 illustrates a user friendly front-end for the system, is it freely available?

General comment: all acronyms should be expanded the first time they are introduced, for example "OMIM, ICD, MeSH, UMLS, MedDRA.." on page 11.

Few typos:
page 6, second paragraph on the right-side: "Since classese .." -> "Since classes .."
page 9, paragraph under Algorithm 1: "needed because is is .." -> "needed because it is .."
page 14, last paragraph: "as natural languages .. " -> "as natural language .. "

Review #2
By Kaarel Kaljurand submitted on 06/Mar/2015
Major Revision
Review Comment:

This paper describes a natural language query system over biomedical linked data. The (controlled) natural language component of the system is implemented in Grammatical Framework and is available in English and Romanian. The system maps the natural language input to the corresponding SPARQL query for execution. The focus of the paper is on the implementation of the biomedical domain specific grammar.

I find the topic (querying biomedical data) important and the technology choice (multilingual CNL implemented in GF) suitable. However the presentation of research results is poor, thus a major revision is recommended.

Originality. Natural language interfaces to SPARQL have been done before (also multilingually and in GF), as the paper's related work section also discusses. The application to the biomedical domain is possibly new but the paper fails to discuss the novel aspects of this application domain, which would not directly follow from the existing general approaches.

The significance of the results is hard to estimate. No actual controlled natural language is presented in the paper (apart from a few example sentences on some figures), i.e. the reader does not get a clear idea of the query language regarding its syntax, semantics, control of ambiguity, and coverage of SPARQL. There is very little evaluation of the developed CNL (how easy it is to read/write/etc?) or the whole translation and query system. There is no link to the (open source) implementation of the system (grammar and the query UI from Fig 12).

Quality of writing. The structure of the paper makes it hard to follow. Also, the English should be improved.

More detailed comments.

I would propose the following structure, which would make the paper easier to follow.

1. Introduction/Goals
2. Related work
3. Controlled English for querying biomedical data: syntax, semantics (as mapping to SPARQL), ambiguity handling
4. Controlled Romanian for querying biomedical data: ...
5. Multilingual implementation of these CNLs in GF, incl. lexicon building from existing resources
6. Usage examples. Possibly user interface considerations (is there a look-ahead editor to help with the query entry, how is ambiguity communicated to the user, etc.)
7. Evaluation of benefits over plain SPARQL input or graphical user interfaces
8. Future work

The current section 3 (specifically 3.1-3.3) is poorly structured and the details remain unclear, especially the role of DL in the system.

The handling of out-of-grammar expressions (in Algorithm 1) by replacing the last word by "XX" and reparsing could be replaced by a more general method. Recent developments in GF support robust parsing which might be the right solution in this case. Also, ambiguity resolution by picking the tree with the "smallest length" needs more motivation. Also, it remains unclear how "smallest length" is defined. Algorithm 1 contains unnecessary technical details ("GF Rest Service"), inconsistent indentation and notation (capitalization, function call notation etc.). Also, avoid "!" as a negation operator. The algorithm is called "English2SPARQL". Would it look any different when applied to Romanian?

Figure/table captions should contain more information about what is shown on the figure/caption.

It would be useful to have a small summary of QALD4 before discussing the evaluation. There should be a separation of development set (set of queries/use cases/etc. for which the system was optimized) and a test set (which the system had not seen before), otherwise the evaluation results are not very meaningful. There could also be an evaluation of how do different formalisms (SPARQL vs CNL) represent the 25 test questions, and insight into why does CNL outperform SPARQL in terms of readability/writability (if it is the case).

The lexicon building section could present the English and Romanian lexicons in parallel, e.g. there could be a table that lists the counts of words and word forms in each language for each concept (drug, disease).

Fig 12. seems to present a developer UI and doesn't even present the query results. In addition (or instead) there could be a figure showing the envisioned end-user query interface. An end-user UI would probably not display two CNLs in parallel, would offer a disambiguation dialog (in case there are multiple abstract trees), would not show any technical details (SPARQL, abstract trees) and would present the query results in a format that is easily relatable back to the CNL input.

In general, the paper could focus more on the multilingual aspects and specifically answer the question of how easy it would be to add an other language to the system considering the available resources (biomed databases for words and GF's resource grammar library for syntactic structures).

Smaller issues

Fig. 5: caption is not in English

Table 2: use more consistent DL terminology, e.g. class expressions and assertions

"object properties relate two concepts" is incorrect, better: "an object property constructs a class from a restriction and another class"

use monospace font for code examples

double "f" (e.g. in "SideEffect") is badly formatted

"rdfs : label" and similar expressions are badly formatted

typo: classese

typo: liniarization

Fig 7: ":" missing on the 2nd line

Review #3
Anonymous submitted on 25/Apr/2015
Minor Revision
Review Comment:

The paper is an interesting example of using CNLs for a shared task open to other formalism. The results reported are impressive, but some minor additions should be taken into consideration:
- in the Introduction, the usage of a CNL is not very clearly motivated, so maybe the part that introduces it could be rewritten in a clearer way
- among the GF papers cited as related work, some missing references are:
+Angelov, Krasimir; Enache, Ramona: Typeful Ontologies with Direct Multilingual Verbalization (which also features verbalization in English and Romanian)
+Davies, Brian; Enache, Ramona; Grondelle, Jeroen van; Pretorius, Laurette: Multilingual Verbalisation of Modular Ontologies using GF and lemon (which is contains more technical details and a more similar approach than reference [7])
+Danne ́lls, Dana; Enache, Ramona; Damova, Mariana: Multilingual Retrieval Interface for Structured Data on the Web (contains more info about a SPARQL library in GF and a similar approach) - this should be discussed in chapter 2.1 also, instead of claiming that "GF libraries do not include resources for SPARQL"
- section 3.4 (Pre- and Post- processing) presents an approach which is overly complicated for its purpose. Instead of going back from the end of the string and replace word by word (rather naive and inefficient manner), one could use the morphological analysis from GF and replace with place holders just those words that cannot be found in the lexicon - this would invocate the parser only once in the end, if it's only the names that should be abstracted over. This would also generate a considerable increase in speed.
- the comments from section 4 about the Romanian grammar seem rather out of place, especially since the author doesn't even cite the original paper about the Romanian grammar:
+Enache, Ramona; Ranta, Aarne; Angelov, Krasimir: An Open-Source Computational Grammar for Romanian
Claiming that "all the nouns in genitive are wrongly built without the possessive article" and choosing less idiomatic solution instead, seems quite odd, given that the GF grammars are open-source and everyone is welcome to contribute. In these conditions, the author could just fix the problem with the genitive, which they claim to have observed or build an extension that fits the purpose of their domain.
- the Romanian examples from the system don't sound idiomatic enough and it is quite unlikely that the system can be used to parse queries directly, unless many other variants are implemented but not shown.

The work is promising and well-presented but it would be highly appreciated if the author would provide proper references about the GF side.