Question Answering with Deep Neural Networks for Semi-Structured Heterogeneous Genealogical Knowledge Graphs

Tracking #: 2710-3924

Omri Suissa
Maayan Zhitomirsky-Geffet1
Avshalom Elmalech

Responsible editor: 
Special Issue Cultural Heritage 2021

Submission type: 
Full Paper
With the rising popularity of user-generated genealogical family trees, new genealogical information systems have been devel-oped. State-of-the-art natural question answering algorithms use deep neural network (DNN) architecture based on self-attention networks. However, some of these models use sequence-based inputs and are not suitable to work with graph-based structure, while graph-based DNN models rely on high levels of comprehensiveness of knowledge graphs that is nonexistent in the genea-logical domain. Moreover, these supervised DNN models require training datasets that are absent in the genealogical domain. This study proposes an end-to-end approach for question answering using genealogical family trees by: 1) representing genealog-ical data as knowledge graphs, 2) converting them to texts, 3) combining them with unstructured texts, and 4) training a trans-former-based question answering model. To evaluate the need for a dedicated approach, a comparison between the fine-tuned model (Uncle-BERT) trained on the auto-generated genealogical dataset and state-of-the-art question-answering models was per-formed. The findings indicate that there are significant differences between answering genealogical questions and open-domain questions. Moreover, the proposed methodology reduces complexity while increasing accuracy and may have practical implica-tions for genealogical research and real-world projects, making genealogical data accessible to experts as well as the general pub-lic.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 25/May/2021
Review Comment:

The article describes a fine-tuned DNN model for question answering in the genealogical domain. In the evaluation this model is compared against state-of-the-art model for question answering, and the results show that the domain of genealogical data differs from question answering in an open domain.
The chapter 2. Related work introduce the GEDCOM genealogical data standard for representing the information of family trees, and the general DNN approaches used in for question answering.
The results provide state-of-the-art result and give a good starting point for future development.

How would the model handle e.g. cases with are missing family relationships, e.g. siblings with missing parent information?
How come the performance of Uncle-BERT_{0} and Uncle-BERT_{1} models is so low even with zero-degree questions, although it is precisely the data that has been used for training it?
There is a significant improvement in second-level model Uncle-BERT_{2} compared to zero- or first-level models. As a future work it would be interesting to see what happens if it is further extended, e.g. to third- or fourth-level?
Are the place name variations (NY, NYC, New York) the only reason for the low performance for question answering related the place names? Personally, In the evaluation section I would have liked to see examples of sentences in the categories of dates or places where the interpreting was correct or false.

Generally the article is well written, it gives a comprehensive review on the related work and theoretical background, and the used methods are describe in details. Also the results are clearly explained and relevant guide lines for the future work are given.

Review #2
By Ricardo Usbeck submitted on 19/Jul/2021
Major Revision
Review Comment:

The article “Question Answering with Deep Neural Networks for Semi-Structured Heterogeneous Genealogical Knowledge Graphs” fits into the topic from the call of papers “Knowledge-Driven NLP for Digital Humanities” as well as “Machine Learning for Knowledge Graphs in Digital Humanities”.

Originality: The work is original as far as I can tell and could lead a new path of downstream NLP tasks.

Significance of the results: For genealogy, this model provides an important stepping stone not only for question answering but for neural network-based NLP tasks in the respective domain (if the model would be available)

Quality of writing: The paper is well-written and easy to follow, for both CS and digital humanities readers.

Long-term stable URL for resources assess:

The paper presents a novel DNN model for QA over genealogical data. Neither the raw GEDCOM data, nor the KG, nor the Gen-SQuAD data nor the fine-tuned QA is publicly available. None of the criteria below can be assessed due to being protected under the European General Data Protection Regulation (GDPR) and Israeli Protection of Privacy Regulations. The model, the vocabulary, tokenizer config, special chars mapping, and other configurations are available to the reviewers. We entrust the authors to publish this data after acceptance. It allows replicability in a sense, that users would need to use their model, the presented model, and their dataset to calculate numbers. However, the numbers from the paper cannot be reproduced. The assessment of the long-term stable URL (which is currently not long-term stable) is: (A) the folder is clean but has no README and thus no instructions on how to load the data. A sample python code on how to load the data and a toy gedcom tree would be nice. Also, a license file is missing. (B) no, see above, (C) no, (D) cannot check due to (A).

(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,
(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,
(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and
(4) whether the provided data artifacts are complete.
The authors propose an end-to-end QA approach for the field of genealogy, a first in the field. The authors use existing semi-structured data (RDF+full-text) and convert it into a form that is suitable for machine-reading/comprehension algorithms.

## Introduction
The introduction reads well and allows laypersons to get familiar with the problem at hand. The motivation is clear.
The example on page 2 might need revision, as the main search engines answer with a number, not a list (Google, Siri, Bing tested). Maybe use a less familiar person.
The structure of the contributions could be improved. Currently, there are two contributions, one examination contribution, and three research questions. It would be easier to also format the contributions according to the research questions in an itemized list.

## Related Work
The related work section is convincing and covers all standard literature for DNNs as well as QA. It is also a good read for beginners, as it introduces all main concepts in detail. Figures one, four, and five help to understand the standard as well as the proposed QA pipeline.
Figures two and three are misplaced in a sense, that the paragraph needed to interpret them comes on the next page.
Section 2.2. could enhance cross-linking the topic by introducing synonyms to the field of work such as Machine Reading Comprehension or Open Domain Question Answering. It is also not clear whether the first paragraph of 2.2 is needed which explains how artificial neural networks work in general. Given the interdomain scope of this work, it might be needed or superficial depending on the reader.
“slow performance of DNNS” should be rephrased as it is not the slow performance of DNNs but the magnitude of comparisons to texts needed if you would compare all indexed texts to a query.
The numbering of compounds of DNN systems is wild and could be improved by introducing letters as second-level items.
The used domain vocabulary could be aligned in a better way, that is either there are static “embeddings” and contextual “embeddings” or static “representations” and dynamic “representations” or later “vectors”.
The description on the final layer of a DNN is rough and not correct, compare page 6, right column in and rework the paragraph to the description of the said final layer. The authors then describe the fine-tuning process in sufficient detail on page 16. Here a forward pointer would stop the interested reader from asking themselves.
Footnote 7 misses a citation to Johnson, Jeff, Matthijs Douze, and Hervé Jégou. "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data (2019) as the state of the art in indexing vectors which is used for retrieval in DNNs.
On page 7, left column, the authors could improve the clarity on which graph is meant when talking about GNNs. In particular, the term knowledge graphs should be introduced beforehand given that the reader might be unfamiliar.
For the generation of text from a KG, recent works such as Moussallem, Diego, et al. "NABU–Multilingual Graph-Based Neural RDF Verbalizer." International Semantic Web Conference. Springer, Cham, 2020 are missing which do not need extensive training data.
Finally, while there are citations for different standard methods such as LSTM and Attention, the authors miss providing a citation on knowledge graph-based question generation.
The authors could improve the description of Figure 5 to help understand whether this is the proposed architecture. This does not become clear throughout the section.

## Method
The method chapter is easy to read and follow. The authors do a good job at describing details where needed despite some missing forward references, see below.
It is unclear why CIDOC-CRM was chosen. A discussion of why other ontologies were not used would help computer scientists to understand that choice. There seem to be also a variety of RDF-based GEDCOM ontologies and vocabularies available. The modeling (Fig. 9) seems to make verbalization hard, i.e. the E67 node is there as a blank node (should be a qualifier in Wikidata). So, people could wonder why CIDOC was chosen.
There are networks like REFORMER (see also which can take input of arbitrary length.
Please explain why parents are considered second-degree relations? Depending on the chosen ontology, e.g. DBpedia, this is different.
The formating on page 12 makes it hard to follow the text flow. It would be better to align Figure 8 on top of the column.
On this page and in the pseudocode, it is unclear why the algorithm stops even after looking at the pseudocode. It would be helpful to provide an intuition here since NQ gets n enqueued.
If it is correct, that the questions were paraphrased using [46], it would be good to clarify is or if not, what is meant on page 14, left column top: “multiple variations…”.
Is SP’s grandfather Alexander on purpose not in Figure 9? Adding this information would make the example figure more valuable.
It would also be good to see an example of a multihop template in Table 1. Or does this paragraph mean, that the model picks up answering multihop questions on its own? It surely does not.
What influence does the order of verbalized sentences have? Did you do experiments on it?
Also, an example of a WH question from the DNN would be interesting to see? Did you evaluate the quality of the generated questions? Do errors in generation (e.g. wrong grammar) influence the model?
Which BERT model did you exactly use? Can you provide a pointer to the base model, e.g. on hugging face hub?
Pre-trained static node embeddings can be used in Transformer architectures and their descendants, see He, Bin, et al. "Integrating Graph Contextualized Knowledge into Pre-trained Language Models." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 2020. Thus, it would be good to rephrase the hint on page 16 right column.
Figure 11 has “Selecting a model” on an arc. What exactly is a model in this context?
Overall, the last paragraph describes the way a user would use the system. Initially, one could think the proposed approach also selects the correct GEDCOM tree from a database of trees but apparently, the user does. Thus, the system has an easier task in terms of retrieval than normal KG QA systems. It would be good to clarify that at the beginning of the section.

## Experimental design
The experimental design section is also well-written and easy to follow. The only unclear part is in the max sequence token explanation. It would be good for non-experts if the authors could explain on an example how this was handled. In particular, how the window was used if the question or the answer span was outside the window of choice and how that interacts with the learning.

## Results
The results section is quite easy to understand and up to par. An ablation study was performed on the input parameter (degree).
There are two questions left:
How does the system deal with non-existent knowledge in a tree but a question to it?
For place questions, can it be that the Uncle-BERT_2 model tries to always pick relations that are two hops away due to training, and thus Uncle-BERT_1 can find these 1-hop-away places? This could be computed based on the selected answers by the models and how far they are away in a simple table.

Minor Issues
Arxiv citations: Some arxiv citations used by the authors (e.g. [92]) have meanwhile been published in peer-reviewed journals. Thus, the suggestion is to use these references via a tool such as
“Training of the DNN” => “Training of a DNN”
“While using DNNs for the open-domain question answering task has” => “have”
Citation [112] to [115] seem out of order
The claim “optimal DNN-based question answering pipeline” should be revisited.

Review #3
By Isaiah Onando Mulang' submitted on 13/Sep/2021
Minor Revision
Review Comment:

The paper addresses a very significant research topic in an emerging domain. Question answering is a well studied research area in Computing with variants including Machine Reading Comprehension - MRC (the SQuAD dataset), QA on Knowledge Graphs (e.g. the QALD series of task on ), and Classical IR QA, However, the authors point out challenges in the Genealogical domain that effectively render QA in this domain challenging and requiring specific methods to address. The extra challenge of dataset in the domain is also tackled in the paper.

Strengths of the Paper
An approach to create Genealogical dataset for question Answering.
An approach to answer questions in the genealogical domain
Evaluation of the method based on fine-tune BERT and SOTA QA methods
The paper is well written and easy to follow, the problem and challenges are well explained.

Relatively weak approach: Although the data generation aspect of the paper is prominent, finetuning of BERT and related Architectures is quite a well experimented task in NLP. Effectively rendering the core aspect/model of the paper weak (Uncle-BERT).
The finding that fine-tuned Uncle-BERT2 model outperforms the rest of the open domain models is expected, as finetuning introduces signals from specific domains. I understand the relevance of this statement in relation to the need to have models specifically trained on the domain data. But it is a tautology.
“The proposed methodology reduces complexity while increasing accuracy and may have practical implications for genealogical research and real-world projects,making genealogical data accessible to experts as well as the general public ..” It is not clear to me which part of the evaluation substantiates the first part of this claim. The second part of the claim is well substantiated.

Pointers and Recommendations
Since the approach uses Heterogeneous sources of information including a graph structured source, there are already existing graph NN based techniques that the authors can borrow from e.g.
Nadgeri A., et al.,
Xu P., et al.,

Since there is use of extra Contextual information on transformers, the following work my add extra insights to the paper:
Mulang’ et al.,
Yamada I. et al.,