TermitUp: Generation and Enrichment of Linked Terminologies

Tracking #: 2693-3907

Authors: 
Patricia Martin-Chozas
Karen Vázquez-Flores
Pablo Calleja
Elena Montiel-Ponsoda
Víctor Rodríguez-Doncel

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Tool/System Report
Abstract: 
Domain-specific terminologies play a central role in many language technology solutions. Substantial manual effort is still involved in the creation of such resources, and many of them are published in proprietary formats that cannot be easily reused in other applications. Automatic Term Extraction tools help alleviate this cumbersome task. However, their results are usually in the form of plain lists of terms or as unstructured data with limited linguistic information. Initiatives such as the Linguistic Linked Open Data cloud (LLOD) foster the publication of language resources in open structured formats, specifically RDF, and their linking to other resources on the Web of Data. In order to leverage the wealth of linguistic data in the LLOD and speed up the creation of linked terminological resources, we propose TermitUp, a service that generates enriched domain specific terminologies directly from corpora, and publishes them in open and structured formats. TermitUp is composed of five modules performing terminology extraction, terminology post-processing, terminology enrichment, term relation validation and RDF publication. As part of the pipeline implemented by this service, existing resources in the LLOD are linked with the resulting terminologies, contributing in this way to the population of the LLOD cloud. TermitUp has been used in the framework of European projects tackling different fields, such as the legal domain, with promising results. Different alternatives on how to model enriched terminologies are considered –good practices illustrated with examples are proposed.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/Apr/2021
Suggestion:
Minor Revision
Review Comment:

This paper presents the development of the a tool called 'TermitUp' for extracting and publishing terminology. The tool is an interesting combination of existing tools and is designed to smooth the path of extracting terminology and publishing this as linked data. There seems to be some evidence that this is being applied in projects such as LYNX and as such there is some value in this tool. I think the paper would be improved if there was a comparison of this with existing tools such as PoolParty, SketchEngine, Saffron and VocBench, as it is not clear what this tool adds exactly.

I did not understand the 'Module 4' of the paper. The goal is to filter relations from linked resources using ConceptNet. There seems to be some assumption that terms are compositional that is not true in general (maybe some reference to the work of Carlos Ramisch and Agata Savary would be appropriate here). I didn't understand the approach, the authors defined some mathematical notation (e.g., s1t1) but don't give a formula for this approach. This method also has no evaluation so we cannot see if the assumptions the authors made here are correct.

The paper shows quite some impact within the range of influence of the authors, with this system being used in a couple of EU projects and industry projects involving UPM. The public GitHub repository (NB: the link in footnote 43 is missing) shows a strong potential for future impact and the fact that this repository has been 'starred' by several people beyond UPM suggests that this work is already having some impact beyond the influence of the authors. It would further strengthen the paper if these impacts could be investigated and documented for the final version of the paper.

As a more general comment, what I feel is missing from TermitUp, given its reliance on automatic term extraction and sense disambiguation is any idea of how to include a human 'in-the-loop' of the system. Is the expectation of the system that automatic results are published straight to the LLOD cloud with no validation? Is there, thus, a risk of publishing low quality data?

The authors should be aware that scientific areas of study are not normally capitalised in English, e.g., 'automatic term extraction'. This occurs on p1l24, p2l8-11L, p2l22L, p2l20-24R, p4l17-23R, p8l12R

p1l6L '*in* light of'
p2l29R 'the its'
p2l23-31R 'Section X' should be capitalised
p3l14R 'consists *of* adding'
p4l9L 'half of them *being*'
p4l39L 'definitions *are* scarce'??
p5l34L 'to *make* explicit'
p5l41L 'termbanks'
p5l39R 'term variation or synonym*y*'
p5l41R 'reused'... this word makes no sense here
It would be better to call Module 3 'Terminology Enrichment'
p8. Could you use subscripts and superscripts in the mathematical formulae? Or remove them entirely as there is no formal mathematical reasoning in this paper?
p9l29L. 'post-tagging' do you mean 'POS-tagging'?
p12l1R. Odd hyphen before 'we'

The references section could be improved:
7, 9, 31 and 57 lack any note about where they are published
15 includes 'itri-04-08' that looks like junk text
15, 21, 22, 23, 33 and 35 could have more complete citation information (volume, number, pages?)
For 42 you could use the full conference name as you do for 29

Review #2
Anonymous submitted on 05/Apr/2021
Suggestion:
Accept
Review Comment:

This article presents TermitUp, a service to generate domain-specific terminologies directly from corpus. Starting from a domain-specific corpus in one language, TermitUp generates a multilingual terminology in open formats (JSON-LD, SKOS or Ontolex-Lemon) enriched with data from the LLOD.

The TermitUp architecture relies on 5 interdependent modules: a terminology extraction module from corpus based on the TBXTools service, a terminology post-processing module based on linguistic patterns (to exclude non-terminological structures), a terminology enriching module with data from the LLOD, a term relation validation to check the correctness of the extracted relationships such as synonymy, and a publication in open formats module either JSON-LD, SKOS or Ontolex.

The issues addressed by the authors, namely building terminology from corpora, are still open. The authors’ proposition is convincing. The article is well written, structured and illustrated. The state of art is exhaustive enough.

The chapter “7 Discussion” raises interesting issues. Let us quote the debates about skos:definition, which can be attached either to skos:Concept or ontolex:LexicalSense. This leads to the distinction between sense and reference, and consequently by explicitly representing concept. Such an approach would be more concept-oriented, as promoted by the GTT and ISO Standards (1087, 704), as illustrated with the ontoterminology approach.

page 12, line 6: replace skos:LexicalSense by ontolex:LexicalSense

Review #3
Anonymous submitted on 16/May/2021
Suggestion:
Minor Revision
Review Comment:

This paper introduces TermitUp, a service aiming at the generation of domain-specific terminologies directly from corpora, semantically enriched with data from existing language resources in the Linguistic Linked Open Data (LLOD) cloud, and published in open and structured formats.

Overall, the paper is well organized and clear, with a comprehensive section on the background and relevant previous work, especially in what concerns the challenges regarding i) automating the generation of terminological resources, and ii) the underlying interlinking process. It also comprises a thorough description of TermitUp's requirements and architecture, while also illustrating its current and potential impact. The final sections address some of its limitations, along with future work aimed at tackling those issues. TermitUp is available in both GitHub and Zenodo, although the GitHub link has not been provided by the authors in footnote 43. I could reach it, nonetheless, via the Prêt-a-LLOD website.

This work, developed within the scope of an H2020-funded project (Prêt-a-LLOD), represents an ambitious and relevant endeavour within the current research landscape of terminology work and its connection to Linguistic Linked Data. On the one hand, it leverages the existing resources in the LLOD cloud, benefitting from the semantic enrichment potential that these datasets entail, and integrates a set of previously isolated technologies into a seemingly robust pipeline. On the other hand, by resorting to both SKOS and Ontolex modelling within a legal use case, and to its subsequent feeding of a SPARQL endpoint, TermitUp provides flexibility to the end-user while also addressing the requirements focusing on reusability and standardisation, as well as on open source and ease of access (#4 and #6, respectively).

As regards the system architecture, there is added value in the disambiguation features in Module 3, as well as in the term relation validation features in Module 4, by resorting to ConceptNet. It is also my understanding that the ongoing challenges involving SKOS and Ontolex modelling, described in Section 7 of the paper, and the subsequent proposal put forward by the authors in [https://www.w3.org/community/ontolex/wiki/Terminology](https://www.w3.org/community/ontolex/wiki/Terminology)), constitute a relevant starting point for the discussion, within the community, on how to model terminological resources as Linked Data, and might help boost more fine-grained representation models.

One of the main challenges concerning the future development of this service, in my opinion, is how TermitUp will scale to other domains and languages, and how the potential issues resulting thereof will be handled. Also as regards future work, and due to its inherent complexity, it will also be interesting to see how the additional module allowing the extraction of domain-specific relations will unfold. Furthermore, I can certainly see the advantage of publishing the resulting terminologies in Terminoteca RDF, at least in the short term.

In my opinion, however, the biggest challenge lies in the fact that some of the existing resources in the LLOD cloud either lack curation or, even worse, can become inactive as soon as their respective projects end, which would make the service more cumbersome or, ultimately, hinder it altogether. The authors refer to this briefly in the paper and seem to be aware of such risks. In fact, outlining and setting up effective quality control processes regarding the resources pertaining to the LLOD cloud represents a necessary discussion within the community that is currently ongoing and which is certainly beyond the scope of this paper.

In conclusion, this is a fairly comprehensive paper overall, following the guidelines underlying the "Tools and Systems Report" articles, and it entails both ambitious and promising research. Its development within the Prêt-a-LLOD project, which collaboratively integrates several stakeholders, clearly demonstrates a level-II impact. This work does, however, have the potential to go beyond its original Prêt-a-LLOD scope and impact other research groups within the community (level III), benefitting from different use cases where it could be put to the test. By tackling the aforementioned challenges brought about by other languages and domains, this service could, when stabilized, be accessible to (and used by) various researchers. In addition, TermitUp could, in my opinion, successfully integrate future educational materials on the topic of LLOD.

I would therefore just suggest some minor revisions:

**Content**:

- Although the paper describes how TermitUp has been successfully deployed in another H2020-funded project (Lynx), and that the service is currently being applied to other recently funded projects, it would be pertinent to have access to more concrete results, from both a quantitative and qualitative standpoint, on how TermitUp ultimately helped improve the projects' pipeline and/or outputs (namely in Lynx).
- On page 6, section 5.1., could you provide more concrete data - namely from the "preliminary study" you refer to - to support your claim that Freeling's performance was not satisfactory when compared to other POS taggers for Spanish?
- On page 9, end of section 5.4, the "noun-verb" pattern is repeated.
- On page 11, end of section 6: although the SmarTerp project appears to be at its onset, it would be relevant to provide more concrete input on which "extra information" would be supplied to interpreting professionals in this regard.

**References**:

- It might be relevant to include the direct reference to Meyer's paper on Knowledge-Rich Contexts (2001): [https://benjamins.com/catalog/nlp.2.15mey](https://benjamins.com/catalog/nlp.2.15mey)
- In your first mention of Ontolex (p. 9, section 5.5), it would be pertinent to include at least a reference, perhaps to the core model: [https://www.w3.org/2016/05/ontolex/](https://www.w3.org/2016/05/ontolex/) or to John P. McCrae, Paul Buitelaar, and Philipp Cimiano. 2017. The OntoLex-Lemon Model: Development and Applications. In Proceedings of eLex 2017, pages 587–597. INT, Troj´ına and Lexical Computing, Lexical Computing CZ s.r.o
- Please include the GitHub link in footnote 43
- Footnote 44 was left blank as well → please renumber the footnotes accordingly

**Linguistic issues/typos:**

- Please replace "aroused" throughout the paper (e.g. with "arose" on p. 12, l. 3, left column)
- p. 2, l. 28, right column → eliminate "the" between "exposes" and "this"
- p. 4, l. 15, left column → "in the cloud" instead of "int the cloud"
- p. 4, l. 32, left column → "bilingual" and "English" instead of "biligual" and "Enlish"
- p. 5, l. 40, right column → "specificity" instead of "specifictiy"
- p. 5, l. 45, left column → "hierarchical" instead of "hierachical"
- p. 9, l. 16, Table 3 caption → "whose RDF version" instead of "which RDF version"
- p. 11, l. 18, left column → "requirement 6" instead of "requirement 7"
- p. 11, l. 43, right column → "or an individual" instead of "and even, or an individual"
- p. 12, l. 36, right column → "terms nor rich linguistic descriptions" instead of "terms and nor rich..."