A Chinese Linked Open Data Model Based on Data Field and Sequence Alignment Theory

Tracking #: 1390-2602

Ting Wang

Responsible editor: 
Jérôme Euzenat

Submission type: 
Full Paper
Knowledge described in literal Chinese is an important part of the online knowledge base. However, there is still no efficient system for large-scale Chinese ontology mapping. To improve the large-scale Chinese ontology mapping in linked open data (LOD), we propose a data field and sequence-alignment-based ontology mapping architecture. First, based on improved nuclear field potential, we compress the dimension of unaligned large-scale Chinese ontology. Second, we use the sequence alignment algorithm to compute similarity between the concepts. Finally, we compare our system with other typical similarity-computing algorithms, and against the mapping of the three Chinese Wikipedia knowledge bases (namely, Baidu Baike, Hudong Baike and Chinese Wikipedia), the precision ratio of the method proposed in this study is approximately 30% higher than that of the TongyiciCilin (TYCCL) algorithm on average, and its recall ratio is approximately 10% higher than that of the edit distance algorithm on average. The overall performance (F1-measure) values are 16% and 6% higher than the performance of the TYCCL and edit distance algorithms on average, respectively. The results of this study may provide useful data for promoting Chinese knowledge sharing and improving the precision and recall ratios of semantic query systems.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Jun/2016
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper proposed a framework to match large-scale Chinese ontologies. It is based on an existing synonym dictionary TYCCL. The framework uses a data field theory to compress ontologies and a sequence alignment algorithm to match concepts.

I found the introduction is too long, and a mix of introduction and related work. I started to have problems since the definitions were given. As the author noted, Chinese ontologies have simple atomic concepts and component concepts. It seems natural to me that there would be cases of 1:n, n:1 or m:n mappings. However, Definition 1 only considers 1:1 mappings and the author never mentioned other possibilities. And I do believe there exist concepts which could not be matched because of the differences in ontology design and conceptual spaces behind them. In Definition 1, the universal restriction is too much on C_s and not correct for C_t.

I also found edit distance for Chinese is problematic. Most Chinese words consist two or more Chinese characters. Individually, these characters normally do not have concrete meanings. Only in combination, the semantics of the words could be measured. I understand the power of edit distance in detecting string similarity in Western alphabet-based systems but doubt its applicability in handling Chinese. Sometimes after changing one Chinese character in a Chinese word, the meaning could be completely different, while it is still possible to keep the main semantics even after multiple operations on the word itself.

It seems that only atomic concepts could be covered by TYCCL. So the similarity of TYCCL (section 3.1.2) is only applicable on atomic concepts. Maybe the edit distance should be adapted to count operations on atomic components of component concepts, instead of the basic Chinese characters. This seems to be the case in section 3.3.2.

What is code distance? Why does a random lamda value between 0.4 and 0.5 meet the adjusting requirement?

I am not familiar with data field theory, but I am not sure that I understood the motivation. It seems that the ontologies need to be compressed before the matching because they are large scale. The algorithm 1 on paper 4 already does the pair-wise similarity calculation between all concepts of the source and the target ontologies. If the scale is not problem for such computation, why is the compression necessary?

Again, I am not familiar with the sequence alignment algorithm. In your example, why were “工业革命” and “世界大战” chosen, but not “工业”, “革命”, “世界”, and “大战”? They should be atomic words in TYCCL. In Algorithm 2, what is the “optimal alignment path of sequence through recurrence”? The author said that “if there is more than one optimal alignment path, then choose one”. What criteria are used here? A random choice? The calculations given in the two examples are not clear. I don’t see where those numbers are from.

About evaluation, it is not clear how the evaluation dataset was constructed. Was there a gold standard? How were the precision and recall measured? Is it based on sampling or thorough manual efforts? This is important to assess the significance of the results.

There are a lot of mathematical definitions and formulae in the paper. But many of them do not have sufficient explanation. For example, what are A and B in formula 10. I don’t understand what B_xy is.

This paper was poorly written. The main problem is the English. There are quite a lot of typos and grammar mistakes. I had to go over many sentences multiple times to understand them or guess what the author meant to say. There are mistakes in self-references, for example, on page 6, “the random number m given in section 4.1.3” should be section 3.1.3, and the “definitions given in Section 3” should be section 2.

In summary, I don’t think it is not up to the journal standard and hence recommend rejecting it.

Review #2
Anonymous submitted on 11/Aug/2016
Major Revision
Review Comment:

This paper proposes an ontology mapping approach which uses data field and sequence-alignment technique. The proposed approach contains three main components: similarity computation, ontology compression and sequence alignment. The most important feature of the proposed approach is its ability of mapping Chinese ontologies. Generally, the proposed approach is well introduced, with a lot of examples to explain related concepts and algorithms.

This paper actually proposes an ontology mapping approach, but the title of this paper tell us this work is about “A Chinese Linked Open Data Model”; I don’t think it is a good title for this paper.

In section 1, the authors mentioned many ontology matching approaches. These approaches are just briefly introduced one by one; I think the authors should given some comments on them or briefly summarize these approaches. Putting the review of all the related work in Section 1 is not appropriate, I suggest the authors adding a new section of related work.

Similarity based on edit distance and a Chinese thesaurus TYCCL are computed and combined by choosing the larger one. These two similarity metrics are really simple and similar metrics have already been used by other ontology mapping systems. Why setting the correlation factor to a random number between 0.4 and 0.5 for the TYCCL-based similarity computation? There is no explanation about it. Setting a parameter to a random value is quite strange.

In section 3.1.3, another correlation factor is computed by equation (5). Is this correlation factor the same thing as the correlation factor in equation (3), which is determined randomly.

Algorithm 1 gives no new information, because all the steps are also introduced in the article.

To deal with large-scale ontology, the proposed approach compresses ontologies based on the data field theory. But what’s the compression results and will the compression influences the final mapping results are not well discussed.

Detail examples are given to explain the Needleman-Wunsch Algorithm-based Deterministic Mapping process. In this section, Chinese words in figures should also has its English translations with them.

Two group of experiments are presented in this paper, one is for the ontology compression algorithm, and another is for the whole mapping approach. For the compression results, we can only find the compression ratios of different ontologies. Is higher compression ratio always better than the lower ratio? Are the results of ontology mapping with and without compression the same or different? How will the compression influence the mapping results? These questions should be discussed in the paper.

Generally, this paper presents a interesting ontology mapping approach for Chinese ontologies. But I think the author should address the above problems before publishing this paper. And the writing and presentation of this paper should also be improved. For example, formulas and figures are in different size.

Review #3
Anonymous submitted on 24/Aug/2016
Review Comment:

This paper presents an ontology matcher specifically designed for aligning Chinese ontologies.

This matcher relies on edit distances and a distance computed from a Chinese thesaurus named TYCCL. In order to scale the paper on compression method based on data field theory.

Notions are used before to be introduced. For instance, TYCCL is introduced Section 3.1.2but used in Section 2. The short presentation of it in Introduction is not understandable: “Tian and Zhao introduced a TYCCL… environment”.
OOV word is used from section 2 but explained Section 3.3.

The explanation of TYCCL thesaurus should be more detailed.

The paper also lacks to show what are the differences between matching Chinese ontologies and ontologies written in other languages. There are plenty of ontology matcher proposed in the literature. This paper has to justify why this method is more suitable for Chinese language than others ontology matchers.

Data field theory seems to be an adaptation of field theory from physics to data. It allows to cluster similar objects. Notions such as short-distance field potential are not explained. There are paragraph such as the 3rd one fo section 3.2 that are not understandable at all.

The choice of such an approach deserve to be explained. Is it more suitable than classical clustering algorithms?

To summarize, even if the topic of Chinese ontology matching is important, this paper is not clear at all.The quality of writing has to be improved a lot.