Information Extraction meets the Semantic Web: A Survey

Tracking #: 1744-2956

Jose L. Martinez-Rodriguez
Aidan Hogan
Ivan Lopez-Arevalo

Responsible editor: 
Andreas Hotho

Submission type: 
Survey Article
We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Semantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Semantic Web resources (languages/ontologies/knowledge-bases/tools) to improve Information Extraction, and/or using Information Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured or semi-structured input source. Linking involves associating each such mention with an appropriate disambiguated identifier referring to the same element in a Semantic Web knowledge-base (or ontology), in some cases creating a new identifier where necessary. With respect to entities, works involving (Named) Entity Recognition, Entity Disambiguation, Entity Linking, etc. in the context of the Semantic Web are considered. With respect to concepts, works involving Term Extraction, Keyword Extraction, Topic Modeling, Topic Labeling, etc., in the context of the Semantic Web are considered. Finally, with respect to relations, works involving Relation Extraction in the context of the Semantic Web are considered. The focus of the majority of the survey is on works applied to unstructured sources (text in natural language); however, we also provide an overview of works that develop custom techniques adapted for semi-structured inputs, namely markup documents and web tables.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Feb/2018
Minor Revision
Review Comment:

The article presents a survey on Information Extraction systems in the context of Semantic Web. Particularly, it discusses extraction and linking of entities, concepts, and relations for unstructured and semi-structured resources.

Although it describes almost recent techniques --with details of features/background resources, etc.-- in the field, there are still important points/discussions missing. That is, I would love to see deeper analysis of the difference between these systems (or some prominent systems) by design in all main Sections. For example, a suggestion for Section 2 could be some systems (e.g., AIDA, J-NERD, etc.) focus on output quality while some others (e.g., AIDA-light, TagMe, etc.) focus on speed. Or some systems (e.g., AIDA variants) only work on named entities while some others (e.g., DBpedia Spotlight and TagMe) also include Wikipedia concepts. There are some discussions here and there in the article, but it should be explicitly presented.
Any pros and cons of using a system? Any pros and cons of using an architecture (e.g., in Section 4, page 43, EEL and then open IE vs. open IE and then EEL)? Any ways to tune a trade-off between precision and recall? etc. This is useful for people who want to choose an off-the-shelf tool for further work.

Additionally, as the article touches many tasks, it would be great if the authors can discuss which tasks are relatively well-studied and which tasks are still promising for new research. Any open problems still need to be addressed? This is useful for researchers or PhD students who want to work on this field.

All in all, the manuscript provides a comprehensive survey on an important field (i.e., Information Extraction) which is highly relevant to Semantic Web community. In general, it is well written and is easy to follow, however, it can be improved.

Minor comments:

1. Descriptions of systems should be checked more carefully, for example: 1.1.) the main difference between AIDA and KORE is the semantic relatedness computation between entities. Even though this point is discussed in page 19, I see it neither in the main descriptions of the two systems nor in the Table 1; 1.2.) AIDA-light uses Stanford NER to spot mentions, and only take sliding windows over the text for extracting the context of a mention (page 13); etc.

2. Title of Section 3 (i.e., Concept Extraction & Linking) may be misleading. E.g., at first, I thought this section is only about Word Sense Disambiguation.

3. Different writing styles (e.g., marry relation in page 2 and meet relation in page 45).

4. Typo in page 10 "...consider Wikipedia as a KB ... as a reference KB".

5. It is probably better to know some state-of-the-art results on some prominent corpora.

6. Some citations needed (e.g., in page 40 "Some systems rely on traditional RE processes, where extracted relations are linked to a KB after extraction...").


Review #2
Anonymous submitted on 13/Mar/2018
Minor Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

The paper Information Extraction meets the Semantic Web offers a very comprehensive survey of over two decades of research in areas related to named entity recognition and linking, topic modelling, keyword extraction, relation extraction, etc. restricted to their use in Semantic Web Contexts. It could be a valuable resource for people who need already digested information on the variety of systems developed in the area.
In my opinion, the paper is well written and follows a coherent organization concerning the covered topics, however and at the same time the paper is extremely long and dense making it difficult to follow at times. Too many works are presented independently one another making difficult to understand their contributions or advantages they brought forward when developed. At the same time comparison between the different approaches are minimal if at all given: that is the survey is not critical but descriptive.
The paper includes a number of tables with the approaches that one may think could be used for comparison purposes, however the information in the tables is sometimes unclear or incomplete, for example what GATE indicates under “Entity” in table 4 or “features” in RE in Table 6.
Some specific comments:
1. web=> Web
2. in analysis => in an analysis
3. seen as two-way => what?
4. how different keywords from terms?
5. how your report is different from Maynard et al book?
6. why do you mention Data Mining?
7. why do you tell us about the appendix before explaining your organization?
8. Listing 1, start in new page
9. Why so many systems wth advantages and disadvantages of each
10. Some systems are reported with year (Approach (YEAR) citation) while others are not
11. Table 1 difference between keywords and strings?
12. Table 1 unclear what disambiguation method is
13. parenthesis ( does not work the syntax if removed
14. , for example, entities are linked => remove “for example”
15. you mention efficiency: any information on algorithmic complexity?
16. Wikipedia is mentioned a lot, what about domains not covered there?
17. References to deep parsing methods?
18. features for entity linking (2.1.1) unclear how used
19. “As such, the confidence of a mention computed during recognition may become a feature for disambiguation” => why?
20. What are Wikipedia “abstracts”?
21. Kan-Dis or KanDis ?
22. What is a high-dense entity?
23. ConLL or CoNLL?
24. Method to select works for this survey should be at the beginning and for the whole survey...
25. Table 3 - relation “statistical” ? 2008 under extraction??
26. Hierarchy Induction - looks more argumentative than the rest of survey
27. Topic extraction section - goes more in depth in the approaches, so it is more useful in a sense.
28. (Lemon) => remove parenthesis
29. et al => et al.
30. improve machine readability => not appropriate term...
31. these boxed => which ones?
32. Table 7 is unclear, how did you come up with the counts?

Just another hint: The authors may also look at work carried out at UPF on entity linking in the music domain (niche area not Wikipedia) and
taxonomy induction (in several domains).

Oramas et al 2016. ELMD: An Automatically Generated Entity Linking Gold Standard Dataset in the Music Domain.
Espinosa-Anke et al 2016. Supervised Distributional Hypernym Discovery via Domain Adaptation.
Espinosa-Anke et al 2016 ExTaSem! Extending, Taxonomizing and Semantifying Domain Terminologies

Review #3
By Simon Scerri submitted on 14/Mar/2018
Minor Revision
Review Comment:

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic
Very good, a researcher new to the area can get a good overview of the current approaches.

(2) How comprehensive and how balanced is the presentation and coverage
Very comprehensive, it covers almost every existing approach in the state-of-the-art.

(3) Readability and clarity of the presentation.
Very good, well-written and well-structured.

(4) Importance of the covered material to the broader Semantic Web community
It nicely covers the overlap of Semantic Web and Information Extraction.

* Summary
This article covers a detailed survey about the existing Information Extraction (IE) approaches which are in some way related to semantic web. The relation can be either using ontologies, Knowledge Bases (KBs) or any of the core semantic web standards (RDF, OWL, ...) for IE or vice versa (enriching ontologies, KBs, KGs using IE). The Focus of this survey is on extraction/linking of three main elements from unstructured/semi-structured text: Entity (usually corresponds to ontology instances); concept (term or keywords); and relation (corresponds to ontology properties). In each section, the authors compare and contrast state-of-the-arts and at the end of each section, they provide some suggestions for the future work.

* Strong points
- The survey is comprehensive and it covers important aspects of the field. In each section, there is a comparison table of different approaches based on some criteria with which a reader can have a quick overview of the current methods.
- There is a good rational behind each categorization of the current approaches which shows the dominance of authors over the field.
- The summary at the end of each section is very helpful for those who seek where to start.
- The article is well-written. It is easy to follow the paper.

* Weak points
- In the survey methodology (page 5), some numbers were expected. How many papers have they collected during the first run? How many were removed? How many were added after reconsidering the methodology. Even in the webpage there is no information about this.
- In section 3.6, covering the evaluation of concept extraction/linking, the Inter Annotator Agreement (IAA), Cohen & Fleiss measures should be discussed. Although partly addressed in section 4.7, these measures are also relevant for section 3.

* Specific comments
- I recommend the authors to reconsider the boundaries arguments within the paragraph structure. Some paragraphs refer sometimes vaguely to claims in the previous paragraph, e.g., page 18, the last two paragraphs start with "such measures/approaches", which is confusing for the reader. Another example is in page 28, second paragraph: which starts with "Otherwise". Furthermore, in page 17, it would be better to itemize the process beforehand rather than starting each paragraph with an ordinal number.

- SProUT[1] is another NLP tool for IE, it can be fit into table 1.
- In the term extraction section, GATE TermRaider[2] can be considered.
- GATE owlExporter[3] is another application for ontology population. It can be included in table 3 (page 33).
- There is a recent paper called HDSKG, which is very related to this survey and I suggest the authors to include this one as well[4].
- The title of table 3 contains a minor error, there is no *Recognition*, but *Extraction* in the columns.