Machine Translation Using Semantic Web Technologies: A Survey

Tracking #: 1694-2906

Diego Moussallem
Axel-Cyrille Ngonga Ngomo
Matthias Wauer

Responsible editor: 
Philipp Cimiano

Submission type: 
Survey Article
A large number of machine translation approaches has been developed recently with the aim of migrating content easily across languages. However, the literature suggests that many semantic boundaries have to be dealt with to achieve better automatic translations. A central issue that machine translation systems must handle is ambiguity. A promising way of overcoming this problem is using semantic web technologies. This article presents the results of a systematic review of approaches that rely on semantic web technologies within machine translation approaches. Overall, our survey suggests that while semantic web technologies can enhance the quality of machine translation outputs for various problems, the combination of both is still in its infancy.
Full PDF Version: 

Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 09/Sep/2017
Major Revision
Review Comment:

This paper presents a survey of Semantic Web technologies applied to machine translation, and while the criticism that this paper is of little interest to MT researchers still holds, this version does a better job of explaining and motivating the use of MT techniques to a Semantic Web audience, which is more appropriate given this is submitted to the Semantic Web Journal. Unfortunately, the authors still make some statements with regards to machine translation that seem to come from a lack of understanding of the problem and are not given sufficient evidence.

The coverage of machine translation techniques still gives a lot of coverage to rule-based machine translation and 'corpus-based machine translation': a term it seems the authors use to group together mainstream statistical machine translation and example-based machine translation, an interesting but significantly less popular framework. Moreover, the section on neural machine translation is too short and is quite dismissive: "Despite some researchers claiming that NN may solve practically all open problems of MT, it is in its infancy". An odd statement when for example at ACL nearly all papers on machine translation used NMT.

"Although RBMT approaches are still currently used due to the difficulty of finding bilingual corpora for some languages": This statement seems to suggesting that researchers building MT for under-resourced languages prefer RBMT, because spending 100's of hours developing rules is easier than using widely available parallel or comparable texts. There is no citation to support this statement.

"Currently, hybrid approaches achieve better results than CBMT": This has no citation. In fact, NMT techniques seem to be winning major evaluations such as WMT.

"reordering a sentence from Chinese to English is one of the most challenging techniques because Chinese sentences do not contain spaces between their characters": I fail to see the connection. High accuracy tokenization for Chinese is widely available and as reordering is generally dependent on more detailed analysis such as parsing. Moreover, it has been observed for Japanese (also without spaces) a simple reversal of all characters can improve translation into English.

"Tense generation": I think the authors seem to be using this term for all morphological issues (I guess because in their native languages most of morphology is related to verb tenses).

"implementations of NMT for non-European target languages are still missing due to the lack of large bilingual data sets on the Web": Again the authors seem to have an impression that parallel corpora are entirely missing for many languages. It is also unclear why this a criticism of NMT in particular.

"Lack of well-defined object properties", "the lack of object properties, like reflexiveness, may limit reasoner in its ability to support translations": Firstly this is quite confusing as the authors mean what OWL calls "property characteristics", while most uses would be more familiar with "object properties"
in another meaning (i.e., as opposed to "datatype properties"). Moreover it is simply not clear why this should affect translation, as there are still very few examples of authors using OWL (or similar) reasoning in NLP systems.

"we suggest recognizing such entities before the translation process, then translating the proper nouns and including them in the training phase": It is not clear what is actually being suggested. It is not clear what SWT technologies have access to many more named entities than a corpus-based approach, and if these are actually more reliable.

A couple of suggestion to improve the paper would be to include recent work on character-based machine translation or translation of sub-word units [1,2,3] as these can handle some of the issues mentioned in this paper. Moreover, a discussion of semantic evaluation methods for MT would be very useful, e.g., MEANT [4].

Minor issues:
Sec 4.2, Why is "MT Method" etc. in fixed width font?
p12. "model.Therefore"
Citation style is inconsistent, e.g., "Jinhua Du" vs. "J.P. McCrae" vs. "McCrae and Cimiano"
Māori has a bar over the a not an Umlaut.

[1] Luong, M., and Manning, C. D. Achieving open vocabulary neural machine translation with hybrid
word-character models. CoRR abs/1604.00788 (2016).
[2] Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword
units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (2016).
[3] Chung, J., Cho, K., and Bengio, Y. A character-level decoder without explicit segmentation for
neural machine translation. CoRR abs/1603.06147 (2016).
[4] Chi-kiu LO, Philipp C. DOWLING, and Dekai WU. "Improving evaluation and optimization of MT systems against MEANT". 10th Workshop on Statistical Machine Translation (at EMNLP 2015), 434-441.

Review #2
Anonymous submitted on 25/Sep/2017
Review Comment:

This is a well-written comprehensive overview of papers on the use of semantic web technologies (SWT) for machine translation (MT). The paper is a bit mechanical - the authors described how they found relevant papers by web searches and then go through each paper and summarize it. What is lacking is a deeper understanding of the current problems of state-of-the-art machine translation systems and how semantic web technologies can help to overcome them.

Only Section 4.3 shows some in-depth analysis of how SWT can contribute to MT. The authors acknowledge that most of the work in this area is either very preliminary or not very successful. SWT and MT are faced with the same problem: Language is inherently ambiguous, in terms of lexicon, morphosyntactic structure, all the way to discourse-level pragmatics. Identifying concepts in a knowledge base or finding the right translation for a word are instances of the same word sense disambiguation problem. So, rather than dogmatically refer to SWT and MT as monolithic technologies, it seems that a deeper understanding of how pieces of information obtained from ontologies and other knowledge bases on the one hand, and parallel and monolingual corpora on the other hand can contribute to solve these problems of ambiguity.

More detailed comments:

It is not clear to me, if SPARQL queries are anything more than accessing a bilingual dictionary - which are a form of parallel data commonly used in MT.

Page 18 states that NMT successfully deals with tense ambiguity. What is the evidence that NMT has "dealt" with it, or it at least performs better than other statistical MT methods?

Right afterwards comes the statement "implementations for NMT for non-European target languages are still missing due to the lack of bilingual data sets", which is simply false, both in terms of premise and conclusion.

Page 20 states: "Syntactic analysis is the most important task in MT" - which is a somewhat dubious statement, and then proceeds to state "None of the current MT approaches use external information about the word in the text", although there is a rich body of literature that uses data annotated with morphological and syntactic analysis. In fact the state of the art in pre-NMT machine translation for Chinese and German are syntax-based statistical models that use such external information.

Review #3
Anonymous submitted on 19/Oct/2017
Major Revision
Review Comment:

In this second version of the paper, the authors made an effort to address the issues raised in the review, especially as for the organization of the paper and the explanations and details given at different stages of the paper. I find section 3 has improved a lot in the sense that MT approaches are systematically explained and MT challenges are nicely presented. Section 4 has also improved in the level of details given for each study and the way they have been classified or grouped. However, I still have two major concerns with this paper: section 2.1, the formulation of the research questions; and section 4.3, the one called Suggestions and Directions.

As for the research questions, it is for me still very unclear the use the make of the terms “Semantic Web Technologies”, “LD”, “ontological knowledge”, and “LD driven tools”. As the survey is trying to answer these questions, these terms should be clearly defined at this stage in the paper.

It is then in section 4.2 when they –again very briefly- refer to “SW methods” and define them as SWT (Semantic Web Technologies) "used to extract the knowledge contained in a given SW resource, for instance, semantic annotation techniques, SPARQL queries and reasoning" (quoting). I believe that if the purpose of the paper is to answer the question “How can SWT enhance MT quality”, you cannot simply mention in passing what you mean by SWT just before describing the different studies in detail. In fact, I do not see the interest in starting section 4 (Applying Semantic Web Technologies) with a detailed explanation of the metrics used in MT.

As for the “Resource” used in each study, I have the same concern, I am not sure the classification they provide there is suitable enough for such a survey. Why is it important to differentiate between ontologies and LOD? What are BabelNet or DBpedia according to that classification? I am still very confused with the introduction to this section.

Regarding section 4.3, I think most information should come earlier in the paper, since it is crucial to understand the problems involved in MT and why SW resources are perceived as solving some of these issues (in fact, some information is repeated in several sections of the paper). Table 4 also needs to be explained in the text. Four variables are mentioned there that have not been appropriately defined in the paper. Also, how do the issues detailed in section 4.3 (disambiguation, named entities, non-standard speech) combine with or affect the variables in table 4?

I also believe that some of the statements formulated by the authors in this section are too strong or at least not appropriately formulated (p. 17: “SW community has worked out basic suggestions for generating structured data. (…) Due to this lack of defined standards, some research works have produced erroneous ontologies or LOD repositories”). In any case, they should provide evidence.

Some statements need to be proved or at least some evidence has to be given to support them (p. 17 “Therefore, we suggest including SW resources for the WSD task during the translation phase. Our suggestion aims to produce good results without the necessity of editing techniques either before or after a translation of common words”).

Finally, I would also suggest the authors to follow a similar pattern when describing the different studies, since some may contain too many details and others too few, or even the order followed to present the different aspects of the study is sometimes confusing. I would also suggest they review the abstract. I do not think it mirrors the content of the paper.