Question Answering over BioMedical Linked Data with Grammatical Framework

Tracking #: 1125-2337

Authors: 
Anca Marginean

Responsible editor: 
Guest Editors Question Answering Linked Data

Submission type: 
Full Paper
Abstract: 
The blending of linked data with ontologies leverages the access to data. GFMed introduces grammars for a controlled natural language targeted towards biomedical linked data and the corresponding controlled SPARQL language. The grammars are described in Grammatical Framework and introduce linguistic and SPARQL phrases mostly about drugs, diseases and relationships between them. The semantic and linguistic chunks correspond to Description Logic constructors. Problems and solutions for querying biomedical linked data with Romanian, beside English, are also considered in the context of GF.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christina Unger submitted on 17/Aug/2015
Suggestion:
Minor Revision
Review Comment:

Remaining issues
----------------

* One of the reviewers mentioned that he is missing a clear account of the syntax and semantics of the CNL as well as its evaluation. I think there is a slight mismatch between how the term CNL is usually used and how it is understood in your paper, and I think it can be removed by clearly stating your goal. Usually, a CNL is a fragment of NL with the goal of being easier to understand and/or learn. As I understand it, this is not your main goal. In your case, the CNL is more or less given by the QALD benchmark, and you construct a grammar for question answering that covers these questions and that is evaluated on them. Stating this more clearly in the introduction will avoid misunderstandings.

* Also, the question how much of the SPARQL language is covered by the grammar is not clearly answered by the paper. Is it full SPARQL 1.1? And if not, what is not covered (e.g. property chains)?

* Figure 3 (to which you refer on page 3) is missing.

* On page 10, you say that the translation can be done in both ways (NL -> SPARQL and SPARQL -> NL), if no pre/post-processing is involved, and then you say: "For example, the next query is identified as being the SPARQL linearization of 17 different abstract trees." I think this is very confusing. Is it an example for the direction NL -> SPARQL? In that case, one would expect the natural language question together with the query. And what does the number of abstract syntax trees matter here?

* On page 11, you mention using GF's morphological analysis as preprocessing alternative. You should additionally say that this will be explored as future work (or something like this), because now the reader must wonder why you didn't use it if it's a more efficient alternative.

* Regarding Table 5, you should mention that these are results on the test set of QALD-4.

* In Section 4, you should avoid having only one subsection (4.1). Either you separate it into more, or you keep everything in one flowing text. Same for the paragraph "Results" in 4.1.

Consistency
-----------

* You use different capitalizations of "Linked Data" vs "linked data" and "Sider" vs "SIDER". Just stick to one.

* For enumerations in the text, you sometimes use (a), (b) or (i), (ii) or i), ii) or (1), (2). I think it's better to pick one version and use it throughout the paper.

* Once you write "sameAs" and once "owl:sameAs". I would always use the latter.

Formatting
----------

* One of the reviewers suggested using a monospace font (e.g. \tt) for all code (including abstract syntax functions and trees, SPARQL expressions, etc) and I agree with him. The way the code is italicized now, both in the figures and in the text, is not very pretty or pleasant to read. Also, in the text you are not always consistent with italicazing or not, and words like "SideEffects" are not formatted correctly -- using \tt would solve these issues as well.

* There are a few places where you have long expressions or code examples in the text. Instead of inlining them, I would use a \begin{center}...\end{center} environment; this would make them easier to read and digest. This mainly affects:
- "Triplet : Type ...", "PropertyT : Type ..." and "Statement : Type ..." on page 3
- "(WithPossibleDrugsCriterion ...)" and "?dis ds:diseasome ..." on page 7
- "[GiveSiderProperty ...]" on page 8

* The grey font you use for comments could be a little bit darker, so it's better to read when printing the paper.

* Just from an aesthetic point of view: You never use vertical lines in tables, except for in Table 1. I would remove them there as well.

Typos, grammatical errors, and reformulation proposals
------------------------------------------------------

Title

* BioMedical --> Biomedical

Abstract

* beside English --> besides English

Page 1

* A large number of data --> A large amount of data
* domains such as government data --> domains such as government
* Frequently, the linked data are described with the use of large terminologies --> Frequently, linked data is described by means of large terminologies.
* the lack of previous and detailed knowledge --> the lack of detailed knowledge
* In the same time, being a restricted natural language, building it with --> At the same time, building a restricted natural language with
* ontologies and more recently --> ontologies, and more recently
* CHILL system --> The CHILL system
* intuitive representation to formal representations --> intuitive representations of formal representations
* by making a trade-off --> with a trade-off
* is done in [12] --> can be found in [12]
* we propose a system (GFMed) --> we propose a system, GFMed,

Page 2

* are biomedical data --> is biomedical data
* of the Question Answering over Linked Data (QALD-4) --> of the Question Answering over Linked Data challenge (QALD-4)
* and gives chemical --> and contains chemical
* disease-gene network --> disease-gene networks
* In case of SPARQL concrete grammar --> In case of the SPARQL concrete grammar
* considered for CNL's semantic --> considered as the CNL's semantics
* preffered --> preferred

Page 3

* In the "Triplet : Type" and "Statement : Type" examples in the text, I would leave out the final ";" (as you do for PropertyT).
* values for the property --> values of the property
* inside the FILTER expressions --> inside of FILTER expressions
* You call ORDER BY, LIMIT etc. "optional statements", but I think "solution modifiers" would be a better term.
* patterns, as --> patterns, such as

Page 4

* applies on a string --> applies to a string
* Again, I would rather say "solution modifiers" than "optional SPARQL clause" (unless you mean optional clauses, i.e. OPTIONAL {...}).
* and on a graph pattern --> and to a graph pattern
* aims for --> aims at
* Instead of separating the example by a colon, I would use brackets. That is, "by their name (e.g. lepirudin, rickets, fever)" instead of "by their name: lepirudin, rickets, fever".
* table 1 --> Table 1
* entities references --> entity references
* You day "in order of tens" -- why not just give an exact number?
* Next section --> The next section
* DL based questions --> DL-based questions

Page 5

* from a DLs perspective --> from a DL perspective
* Fig. 5: GF English lib --> GF English Library
* within linearization of different restriction functions --> when linearizing different restriction functions
* case in which --> in which case

Page 6

* Table 2: Sider-Property --> SiderProperty (Maybe you could extend the left column, so DrugBankProperty and DiseasomeProperty fit on one line?)
* Table 2: by an Y --> by a Y
* Table 2: I would put "Lepirudin" and "drugs that target Prothrombin" in brackets instead of separating it by a colon.
* table 3 --> Table 3
* their DL corresponding expression --> their corresponding DL expression

Page 7

* in more ways --> in more than one way
* by GF library --> by the GF library
* restriction on the property --> restrictions on the property
* SPARQL Linearization --> SPARQL linearization
* restriction on the inverse property --> restrictions on the inverse property
* allow statements --> allow for statements

Page 8

* This is identified as --> This is parsed as
* either on datatype or object property --> either on datatype or object properties
* current version of CNL --> current version of the CNL

Page 9

* build on them --> build from them
* their linearization in SPARQL --> their linearizations in SPARQL
* domain dependent and independent --> domain-dependent and -independent
* makes facile --> facilitates
* Fig. 10: There is a space missing between "sd:sider/sideEffect" and "?vp"?
* applied on one class --> applied to one class
* or on a list of classes --> or to a list of classes
* deal mostly --> mostly deal
* An exception to this rule is the question WhatPropertyValue --> An exception to this rule is the function WhatPropertyValue
* This question treats PropertyClass --> This function applies to arguments of type PropertyClass

Page 10

* The advantage of taking the described approach --> The advantage of the described approach
* in composition of trees/-constructors --> in the composition of trees and tree constructors
* but also for SPARQL queries to natural language questions --> but also for translating SPARQL queries into natural language
* consumes GF translation service --> interfaces with the GF translation service

Page 11

* from QALD test set --> from the QALD test set
* split by --> split at
* Generated Lexicons --> Generated lexicons
* For Genes --> For genes
* Table 5: Results for GFMed in Task2 of QALD4 --> Results of GFMed in Task 2 of QALD-4
* QALD4 --> QALD-4
* overall evaluation from the table 5 --> overall results shown in Table 5

Page 12

* DB01577 so it missed --> DB01577, so it missed
* differently compared to --> differently from
* concrete grammars for English --> concrete grammar for English
* beside English --> besides English
* translated manuall in Romanian, resulting in two lexicons, one for each language. --> translated manually into Romanian.
* the Romanian lexicons --> the Romanian lexicon
* in Romanian lexicon --> in the Romanian lexicon
* From efficiency reasons --> For efficiency reasons
* using GF library --> using the GF library
* that addition of other languages --> that the addition of other languages
* domain specific terminology --> domain-specific terminology

Page 13

* Bio2RDF is a project that aims at providing Linked Data for the Life Sciences [4] --> Bio2RDF [4] is a project that aims at providing Linked Data for the Life Sciences
* formalized with RDF --> formalized in RDF
* building Romanian lexicon --> building a Romanian lexicon
* If in case of other languages --> For other languages
* solution, in case of Romanian --> solution, but in case of Romanian
* version for ICD-10AM --> version of ICD-10AM
* follows the next steps --> follows the following steps
* Capitalize the "The" in all four steps, and add a "." at the end of step 1.
* ICD classified disorders --> ICD-classified disorders

Page 14

* filtering by name --> filtering by the name
* type ii phrase --> the phrase type ii
* different than --> different from
* are appropriate --> might be appropriate (Since you didn't try and evaluate it.)
* with richer lexical layer --> with a richer lexical layer
* 98% of classes and 20% of properties --> 98% of the classes and 20% of the properties
* Common with this approach, --> Similar to this approach,
* in similar way --> in a similar way
* semantic of the aimed linked data --> semantics of the targeted linked data
* result in automatic derivation --> in the automatic derivation
* SQUALL [10] is another controlled natural language that allows SPARQL queries and updates and it relies on Montague grammar. --> SQUALL [10] is another controlled natural language that allows for a translation into SPARQL queries, relying on Montague grammar.
* biomedical domain calls for --> the biomedical domain calls for
* for biomedical domain --> for the biomedical domain
* are Linked Data --> is Linked Data
* syntax, semantic --> syntax, semantics
* An incremental built --> An incremental construction
* and it uses --> but uses
* withing --> within

Page 15

* resources existing --> existing resources
* language, hamper --> language hamper
* In the same time --> At the same time
* built within Grammatical Framework --> built with Grammatical Framework
* makes possible extension [...] about medical publications --> makes an extension [...] about medical publications possible
* Fig. 13: version for GFMed, with Romanian term --> version of GFMed, with the Romanian term
* categories (2) --> categories, (2)
* either from one criteria to a class, either from --> either from one criteria to a class, or from
* from SPARQL resource --> from the SPARQL resource
* in automatic derivation of the GF functions --> in the automatic derivation of GF functions
* addresses also --> also addresses
* near English --> in addition to English

References

* Please check all titles. They are automatically converted to lower case by BibTeX, unless you enclose them with curly brackets, which results in "gf", "Bio2rdf", "dbpedia", and possibly lots more. (In [14] and [16], for example, it's correct.)
* Also, in [13], one of the names is not displayed correctly: "Schrder". (Probably just use \"o instead of ö.)


Comments