Bilingual dictionary generation and enrichment via graph exploration

Tracking #: 2711-3925

Authors: 
Shashwat Goel
Jorge Gracia
Mikel Lorenzo Forcada

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Full Paper
Abstract: 
In recent years, we have witnessed a steady growth of linguistic information represented and exposed as linked data on the Web. This is allowing the growth and application of openly available linguistic knowledge graphs. In this work we explore techniques that exploit the graph nature of bilingual dictionaries to automatically infer new links (translations). We build upon a cycle density based method: partitioning the graph into biconnected components for a speedup, and simplifying the pipeline through a careful structural analysis that reduces hyperparameter tuning requirements. We also analyse the shortcomings of traditional evaluation metrics used for translation inference and propose to complement them with new ones, both-word precision (BWP) and both-word recall (BWR), aimed at being more informative of algorithmic improvements. On average over twenty-seven language pairs, our algorithm produces dictionaries about 70% the size of existing Apertium RDF dictionaries at a high BWP of 85% from scratch within a minute. Human evaluation shows that 78% of the additional translations generated for enrichment are correct as well. We further describe an interesting use-case: inferring synonyms within a single language, on which our initial human-based evaluation shows an average accuracy of 84%. We release our tool as a free/open-source software which can not only be applied to RDF data and Apertium dictionaries, but is also easily usable for other formats and communities.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 01/May/2021
Suggestion:
Minor Revision
Review Comment:

This paper presents some optimizations on a method for inferring bilingual translations based on finding cycles in the translation graph. This clearly relates to the existing TIAD task, which has attracted some interest in this task for a number of years, so algorithmic improvements to this are welcome. The authors motivate this task well also and give a number of use cases. The first main contribution of this paper appears to be an increase in the efficiency of the algorithm using biconnected components, this is a nice techniques as no cycle can exist between different biconnected components, so this allows the algorithm to be improved by a process of divide-and-conquer. As the authors note this, for their data, leaves only 7 very large components to compute. I do wonder if a more aggressive approach could be taken to further reduce the task's complexity, by looking for components connected by more than one but still few nodes. I also think that an example with a nice diagram would explain this better.

The second contribution is an exploration of the hyperparameters that provides some nice insight to this task and they show the effect this has on the results. I do miss a comparison to other approaches, especially the baselines used in the TIAD task such as OTIC, to properly show the value of this approach.

Finally, the authors discuss the evaluation and introduce the use of BWP and its advantages over standard precision and dealing with the incompleteness of the gold standard. The discussion is valid and may be of interest to others in this area. I would however note that our previous work on this task already used BWP and just called it 'precision' so I am not sure how new this work is.

Minor
p1l42. capitalise 'Resource Description Framework'
p2l5. also 'Web Ontology Language'
p23l6 'Apertium'

Review #2
By Basil Ell submitted on 20/Jun/2021
Suggestion:
Major Revision
Review Comment:

The authors present an approach that exploits a graph created from bilingual dictionaries to infer translations, thus to create new dictionaries or enrich existing dictionaries. Furthermore, the approach enables to infer synonyms. The approach is built around the detection of cycles. A couple of properties are identified of translation graphs that can enable a graph to be partitioned.

Generating new and enriching existing bilingual dictionaries is an interesting problem and clearly has practical applications. The authors show in detail how they build upon the approach by Villegas et al. The first contribution is that instead of identifying cycles in the entire translation graph, which is computationally hard, they propose to partition the translation graph into biconnected components that can be processed individually without missing any cycle / any translation. As a second contribution, they propose to reduce the translation graph by removing those translations that translate between words with different parts-of-speech. The authors provide empirical evidence that this reduction can lead to smaller biconnected components.

While the technical contribution regarding the extension of the approach by Villegas et al. might seem a bit small for a paper of 24 pages, the authors provide a couple of contributions beyond the ones mentioned so far. They describe hyperparameters of the original approach and of their approach in detail and reduce the number of parameters actually necessary. I really like that the paper contains this discussion. A couple of carefully-crafted experiments have been carried out for which new metrics were introduced. The results are discussed in sufficient detail and show the strengths of the proposed approach, which appears to run for a large dataset in under a minute, and show that held-out bilingual dictionaries can be retained to a large extent - both the speed and the quality are impressive. The evaluation nicely compares BWP and vanilla recall and discusses how evaluation schemes based on precision could grossly underestimate the performance. All parameters of the experiments were clearly described. The authors took great care to make the evaluation of the recall meaningful, by making use also of externally existing data and human assessment. Furthermore, the comparison between the cycle density and the transitive baselines was very interesting. Code and data are available online and are well documented. The quality of writing is fine.

I have a couple of comments, ordered according to where they first occur in the paper.

Major comments:

C7. p3, column 2, section 2.1. Given the 44 languages and 53 language pairs, for how many language pairs would a pivot-based method be applicable? It would be nice to compare against the approach the authors have extended without the extensions.

C8. p4, column 1, lines 12ff. If CQC can be used to validate a dictionary, then won't it be possible to use it in the context of dictionary building, too? This needs to be explained better.

C9. p4, column 1, lines 31ff. I find this paragraph very confusing. E.g., "Both approaches" refers to SenseUniformPaths and the approach by Villegas? What are ambiguous cycles? The differences between cycle density and graph density are not clear.

C11. p5, column 1, line 23ff. Even if part-of-speech information is not taken into account, wouldn't it be simple to take into account part-of-speech information in a post-processing step? Similarly, in your approach you make use of POS tags in a post-processing step (Section 5.2).

C12. p5, column 1, line 45. I wonder why the translation graph is defined as an undirected graph. I am not a expert in translation. However, I expected the graph to be directed. If given a term in one language no corresponding term exists in the other language, then only the edge pointing to the more general term would make sense? If the translation graph is commonly defined as an undirected graph, then leave it as it is. If it is a simplification, then the advantages and limitations need to be discussed.

C13. p5, column 2, line 1. Given the tuple (rep,lang,pos). Since pos does not seem to be a sequence of POS tags but instead a single POS tag, does this mean that rep is a single token? Thus, cross-language correspondences between multi-word expressions cannot be expressed in / be derived from the translation graph? I believe that defining pos as a sequence would not cause problems to the proposed approach. Please clarify.

C14. p6, column 1, end of section 4.1. It would be helpful to have examples here. Moreover, I wonder whether one either wants to extend a translation graph wherever possible, or whether one wants to focus on a specific language pair, and thus try and find circles that begin with one language and that lead to the other language. Why I am confused: according to p5, col2, lines 18ff the aim is to enrich the graph, whereas on p6, col 2, line 35 a target language pair is mentioned.

C15. It is not clear to me why the approach is based on cycles at all. For example, why not identify quasi-cliques? Cycle density would then be the density of the γ-quasi-cliques. What about differences in complexities of identification of γ-quasi-cliques and identification of cycles?

C23. p10, column 2, Implementation. I would be nice to have an overview over the entire approach, e.g., in the form of pseudocode. Information about configuration files and command line parameters is of not much relevance and could be omitted from the paper. The implementation described does not mention the filtering based on POS tags etc. and does not explain how either synonyms are inferred vs. how translations are inferred.

C24. p11, column 2, Figure 3. Before I saw the figure I believed that I understood what a biconnected component is created from, namely, nodes representing words and edges representing translations. However, the graph shown has languages as nodes. Is this the graph that one obtains if every node that represents a (word, language) tuple is replaced by the language? Please clarify.

C29. p19, Figure 19. I really do not understand neither the Figure nor its description in Section 10.3. Please clarify.

Minor comments:

C1. p1, abstract. I do not see how "in recent years, we have witnessed a steady growth of linguistic information represented and exposed as linked data on the Web" actually "is allowing the growth and application of openly available linguistic knowledge graphs". There is a disconnect.

C2. p1, abstract. Besides the percentage values, it would be interesting to also provide absolute numbers, e.g., how many new correspondences were found. It would be fine to not mentioning them in the abstract, but even from the other content of the paper these actual numbers could be clearly stated.

C3. p2, running title. The title "Bilingual dictionary enrichment via graphs" does not describe the content of the paper very well.

C4. p2, column 1, line 9. "thorough" -> "through"

C5. p2, column 2, example in line 24ff. It would be nice to show this as a graph. This will also show the reader that the graph of correspondences is undirected, which I do not find intuitive.

C6. p2, column 2, line 45. I have an idea what you mean with "previously empirical ones", but this could probably be reformulated.

C10. p4, column 2, line 7. According to Figure 1, for Sardinian and Italian it seems like Catalan is a pivot language. Then, OTIC would be applicable?

C16. p6, column 2, lines 5ff. Although some of the constraints are rather intuitive, it would be nice for example for constraint c to have an explanation, or even a (counter)example.

C17. p7, column 1, lines 1ff. It should be "different maximal biconnected components can only share cut vertices" instead of "different biconnected components can only share cut vertices". Otherwise, if one component is a sub-component of the other component, then none of the shared vertices would be a cut vertice. Later in the text, e.g., p7, col 1, line 32, you should also write maximal biconnected components instead of biconnected components, because otherwise, given a non-maximal biconnected component, wouldn't you miss some cycles?

C18. p7, column 1, line 6. "O(V + E)" should be "O(|V| + |E|)"

C19. p7, column 1, lines 25/26. the comma in "\forall v \in V(C), v \in V(B)" should most likely be a ":".

C20. p7, column 2, Table 1 - the table is not referenced or discussed in the text.

C21. p7, column 2, Table 3 - size means number of edges?

C22. p8, column 1, Section 5.3. Make clear how the biconnected components are used, e.g., how to create context graphs from a set of maximal biconnected components given a source word. I believe the paragraph should begin with something like "Within a biconnected component we ..."

C25. p15, column 1, line 40. consider changing properNoun to proper noun. This appears a couple of times in the text an in figures.

C26. p15, column 1, line 44. "often encode grammatical changes instead of semantic similarity". Could you explain this further? How does this result in high precision for open POS etc.?

C27. p16, column 2, line 47. It would be interesting to mention how the accuracy metrics differed across the 27 language pairs.

C28. p16, column 1, Table 5. Usually, one would set the best values per column in bold. Also: Table 6.

C30. p22, column 2, line 9. "It makes the most of" sounds a bit exaggerated.

C31. p22, column 2, line 35, "amount extra translations" -> "amount of extra translations"

C32. p22, column 1, lines 20/21. "relying purely on translation graphs over documents". probably, "over documents" can be removed? then, it would be more clear.

C33. p22, column 1, lines 23ff. This can be misleading to the reader. Of course the approach does not require training time, because no model is trained. However, cycle finding still takes time.

Review #3
Anonymous submitted on 14/Jul/2021
Suggestion:
Accept
Review Comment:

This paper is very detailed and proposes generation and enrichment of bilingual dictionaries through a cycle density based method expanding a graph of bilingual translations and inferring new translations. Furthermore, authors propose to complement evaluation metrics for translation inference by means of both-word precision and both-word recall.
With reference to the cycle density method, some modifications have been proposed to improve time performance and scalability to larger graphs. Particularly, among the hyperparameters proposed in the original cycle density method, some have been discarded, as they do not lead to better results. Used hyperparameters include maximum cycle length, context depth, target degree multiplier, transitivity, confidence threshold.
The proposed method has been tested in both generation and enrichment of bilingual dictionaries, using Apertium RDF graph for 11 language pairs across 6 languages, i.e., English, Catalan, Spanish, Esperanto, French, and Occitan.
The methodology sounds and the experimental part is clear and well presented, guaranteeing its reproducibility.
The work is solid and devoted not only to adapt the original method for dictionary generation and enrichment, but also to propose a more suitable evaluation framework, which may be relevant for similar scenarios.