Boosting Document Retrieval with Knowledge Extraction and Linked Data

Tracking #: 1754-2966

Marco Rospocher
Francesco Corcoglioniti
Mauro Dragoni

Responsible editor: 
Andreas Hotho

Submission type: 
Full Paper
Given a document collection, Document Retrieval is the task of returning the most relevant documents for a specified user query. In this paper, we assess a document retrieval approach exploiting Linked Open Data and Knowledge Extraction techniques. Based on Natural Language Processing methods (e.g., Entity Linking, Frame Detection), knowledge extraction allows disambiguating the semantic content of queries and documents, linking it to established Linked Open Data resources (e.g., DBpedia, YAGO) from which additional semantic terms (entities, types, frames, temporal information) are imported to realize a semantic-based expansion of queries and documents. The approach, implemented in the KE4IR system, has been evaluated on different state-of-the-art datasets, on a total of 555 queries and with document collections spanning from few hundreds to more than a million of resources. The results show that the expansion with semantic content extracted from queries and documents enables consistently outperforming retrieval performances when only textual information is exploited; on a specific dataset for semantic search, KE4IR outperforms a reference ontology-based search system. The experiments also validate the feasibility of applying knowledge extraction techniques for document retrieval ---i.e., processing the document collection, building the expanded index, and searching over it--- on large collections (e.g., TREC WT10g).
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 02/Dec/2017
Minor Revision
Review Comment:

The paper concerns query expansion in document retrieval. The idea is to link documents to underlying linked data background knowledge from which additional search terms and semantic structures can be found. The paper presents a system called KE4IR with promisings evalation results that 1) support and extend some earlier results of the authors [7], 2) obtain better results than when using just hierarcical ontology structures in expansion, and 3) show some scalability results.

After the introduction, related works are nicely discussed, the approach is presented with enough technical details, and three substantive evaluations of the system using different datasets are presented. In my mind the paper presents some original and significant results related to the idea of exploiting background knowledge in IR. It is not a big suprise that combining different semantic layers in query expansion produces best results, but papers like this are needed to test and prove intuitions such as this through actual implementations and evaluations. The structure and language of the paper is good. However, there were some problems in rendering the equations of the paper that need to be solved.

I therefore recommend accepting the paper; some comments are given below.

em dash --- should be used without spaces around

em dash --- should be used without spaces around

TYPE layer: tell how you do semantic disambiguation here. E.g., Gauss is also a unit.

inspired te -> inspired the

equation (1) layout is a mess

equation (2) layout is a mess

Sigma sign in text in a wrong position

symbols missing in equation (8)

pioneers' -> pioneers"

W(SEMANTICS) 6 -> what is this?

indicates -> indicate

Review #2
By Sébastien Harispe submitted on 07/Jan/2018
Review Comment:

Boosting Document Retrieval with Knowledge Extraction and Linked Data

The paper investigates the benefits of using Linked Open Data as well as Knowledge Extraction techniques for real-world document retrieval tasks. It presents and evaluates an approach based on semantic-based expansion of queries and documents that has been implemented in the KE4IR system. The proposed study is conducted respecting scientific standards, using well-recognized evaluation metrics and benchmarks; the source code of the system implementation is also made publicly available for further reuse and analysis. In addition to provided source code, the paper is well written and provides enough details to both reproduce and fully understand presented results.

Interestingly, the authors strengthen several findings of one of their previous study (introducing KE4IR), and show on many large-scale datasets that document retrieval can be improved by defining systems that integrate indexing and querying approaches exploiting Linked Open Data and Knowledge Extractions techniques. I would like to stress the large and very much appreciated engineering and evaluation effort provided by the authors for developing, testing, and evaluating their system. Based on my understanding and analysis of this work, and even if the discussion part could have been extended to cover important aspects of document retrieval that are not discussed (mentioned hereafter), I recommend accepting the proposed work which I consider very interesting.

Comments are provided hereafter – note that most of the following remarks are comments and not modifications that have to be made:

• To complete your state of the art, note that works have also explored modeling Ontology-based information retrieval using semantic similarity measures and aggregation operators, e.g. User centered and ontology based information retrieval system for life sciences, Ranwez et al. - this approach is different from traditional VSM extensions since it relies on direct assessment of semantic similarity analysis and is based on Yager's operators. Regarding state of the art, works related to question answering using Knowledge Representations could have also been mentioned.

• The use of the dot product instead of the cosine similarity could further be discussed (p. 6), since originally the choice of the cosine similarity was indeed made in order to only incorporate vector orientation, and to avoid distinguishing vectors based on their ‘magnitude’. Dropping ||q||_2 is indeed not a problem for your use case. However, the mentioned side effects of incorporating ||d||_2 leads to a deeper discussion. Indeed, by using a similarity expressed only using the dot product you implicitly consider that vectors are expressed in an orthonormal basis which is very far to be the case considering your modeling (since symbolic features associated to so-called semantic terms, further associated to vector dimensions, are even linked by logical implications). This discussion is also related to the way the normalized frequency is further computed. An extended discussion on that aspect could be provided. I also recommend the authors to study works related to cosine extensions (e.g. dw-cosine) – you can for instance refer to Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model by Sidorov et al. 2014.

• To my eyes, w(l(t_i)) should not be incorporated to q_i but rather included in the definition of d . q, i.e., sum_{i=1}^n (d_i . q_i . w(l(t_i)) ); if not, following your argument we could define sim(d,q) = sum_{i=1}^n ( tf_d(t_i,d) . tf_q(t_i,q) . idf(t_i,D) . w(l(t_i)) );

• From a practical point of view, using sum_{l \in L} w(l) = 1 is sort of misleading since it looks like the semantics of the weigh refers to the importance given to each layer, which is not the case considering that the number of dimensions associated to each of them is not the same (it could also be interesting to give their respective sizes). You should consider this remark when the results are discussed (e.g., p. 10). “a w(SEMANTICS) value of 0.0 means that only textual information is used (and no semantic content), while a value of 1.0 means that only semantic information is used (and no textual content).” Yes, but w(SEMANTICS) = 0.5 does not necessarily mean equal importance.

• When considering the TYPE semantic layer, a modeling based on types’ Information Content could also be interesting, e.g. see among works and references proposed by other authors, semantic similarity from natural language and ontology analysis Harispe et al (preprint on ArXiv). This could be used to modify the way the normalized frequency is computed. In addition, implicit and explicit mentions of a topic, e.g. TYPE, could also have been discussed, as it could be interesting to distinguish both cases, e.g. talking about Mathematicians (in general) and mentioning Mathematicians are two different things. Similarly explicit use of semantic relatedness (and not only indirect semantic similarity as you do) could also be used, e.g. Talking about vector spaces, matrices, eigenvalues, linear systems, Gaussian elimination, I indirectly refer to important concepts related to Linear Algebra; however, using the a priori knowledge your model considers none of those URIs would implicitly refer to Linear Algebra (e.g. in the Type semantic layer). An interesting way of incorporating this would be to integrate a ‘weak semantic layer’ that could for instance consider similarities of word embeddings (à la Word2Vec/Glove…).

• 5.3.3, for future work, it would be very interesting to provide the same results in a setting not only exploiting topic titles.

Discussion could also be improved by mentioning:

• Aspects related to multilingual (this can be a strength for your system),
• Management of multiple LOD resources (do we align resources first?),
• Use of uncertainty metrics related to disambiguation/NERC (why not incorporating this information into the model since it is of major importance?),
• Extensions to more refined state of the art IR models,
• And objectively, from a practical point of view, based on the comparison made with state of the art IR systems, considering the improvement we observe in Table 9 as well as the process overload mentioned in p. 8, is it really worth it? Is per definition general IR problem not suited to the use of refined ‘contemporary’ semantic-based approaches?

Minor comments:
• Even if |v| is sometime used to refer to the Euclidian norm, ||v|| or even ||v||_2 makes the reference to the L_2 norm non-ambiguous.
• log(0) is undefined eq. 5, eq. 7 undefined for empty set as denominator.
• Mentions to PIKES and KE4IR (but not for FRED) are made using a specific font; it has to be changed if it is not made on purpose.

Review #3
By Harald Sack submitted on 30/Jan/2018
Major Revision
Review Comment:

The authors propose, implement and evaluate a document retrieval system that applies advanced knowledge extraction technology and exploits Linked Open Data to improve retrieval performance. In particular the query as well as the documents are subject to entity linking, type determination, frame extraction, as well as temporal analysis. A simple vector space model for retrieval is enhanced with entities (DBpedia URIs), types of those entities (via exploiting related LOD resources), semantic frames (via frame extraction), as well as temporal information (explicitly given in the text or indirectly derived via exploiting LOD resources of the previously determined entities.) The authors apply existing state-of-the-art knowledge extraction tools (PIKES) to extract semantic content from textual resources, which are furthermore represented as URIs from corresponding LOD resources. The authors provide the sources for the implementation as well as the used evaluation datasets online. For the evaluation two specialized dataset for semantic search (WES2015, Fernandez&al2011) as well as standard retrieval benchmarks (TREC6/7/8/9/2001) were applied. The implemented system shows improvement over text-based document retrieval as well as a more slighter improvement over existing semantic search approaches (Fernandez&al2011).

The paper is well written and easy to follow. It definitely fits well to the topics of SWJ. The originality of the paper on the other hand is limited by the fact that the presented system was already introduced in another publication of the authors [7] and now the evaluation was extended for this paper. The significant result that semantic technologies are capable to enhance (text-based) document retrieval has already been shown before.


1. Since for both, the documents as well as for the user query, a semantic analysis has to be performed to extract the underlying entities, context is required for correct disambiguation. How is context derived for the user query, esp. for ambiguous queries. Does the entity linking always decide for one matching entity in the query, although potentially two or several choices would be equally likely? (As e.g. for the query q526, „What is BMI?“ In Table 6) How well did the entity linking work for the queries? As the authors explicitly mention potential errors in the entity linking process, please provide in addition information about the quality of entity linking (at least for the semantic search WES2015 and Fernandez&al2011).
2. For the TIME layer (p.4/5), the authors mention that different granularity levels were applied (ranging from days to centuries). If I understand Table 1 correctly, the final scoring for q related to time results in an emphasis on the finer granularities. Please explain why this has been chosen that way.
3. (Minor) to keep consistency with the rest of the paper, please describe the Time layer after the Frame layer
4. (Minor) p.6, 3.2, First sentence „The KE4IR retrieval model is inspired BY the Vector Space Model“
5. Also in 3.2, the authors claim that the number of semantic annotations in a document is not proportional to the document length. Please explain why.
6. The authors claim in p.8 that „RDFS reasoning is applied to materialize inferences“. Does this mean that all (!) potential inferences (simple inferences, RDF inferences, RDFS inferences) are applied?
7. As DBpedia Spotlight is used for Entity Linking, which parameter set has been used (optimized for precision vs. recall)?
8. The achieved results in Table 4 (WES2015) should also directly be compared against the results of [8]. Later in the text, the authors claim that a direct comparison would not be possible, since [8] is operating on a perfectly annotated dataset. Also, the authors claim that the system of [8] might provide slightly better retrieval results (for some cases) because of that. This can only be assumed, if not further evaluated. However, WES2015 is available as fully annotated dataset ( Thus a direct comparison with KE4IR would be possible by simply using the provided semantic annotations with WES 2015. Please compare KE4IR with the semantic search system of [8].
9. For table 4, a query wise comparison for WES2015, as presented in table 6, would be interesting to determine the influence of the queries on the achieved retrieval results. As well the influence of the different components (entities, types, frames, time) for the single queries would be helpful (see also comment 13)
10. For table 4 the achieved improvement of KE4IR vs. the text-based baseline is rather small compared to the other benchmarks. Why is this the case? What makes this dataset either „harder“ for KE4IR, or „easier“ for the text-based baseline?
11. However, with only little improvement (as shown in table 4) is it really worth while to invest in semantic search at all? The authors don’t provide a quality-based evaluation, which could indicate at least a user preference for one of the compared systems, which might additionally support the achieved quantitative results.
12. On p.14, the authors note that for both compared systems, the generally achieved quality for „title-only“ is better compared to „title + description“. They claim „This should not surprise given the nature and typical content of topic descriptions.“ Could you please be more precise and elaborate on that?
13. If the research hypothesis is only to show that including semantic technology into the retrieval process achieves better results compared to a mere text-based baseline, then this has already been shown before (as in the mentioned papers [8, 34]. Interesting would be to point out, when comparing different semantic search systems (as with [8] and [34]), which „semantic“ components are responsible for (further) improvements (compared to the semantic technologies used in the competing systems).
14. (Minor) p.2, last line left column „These new evaluations:“