The euBusinessGraph Ontology: a Lightweight Ontology for Harmonizing Basic Company Information

Tracking #: 2421-3635

Dumitru Roman
Vladimir Alexiev
Javier Paniagua
Brian Elvesæter
Bjørn Marius von Zernichow
Ahmet Soylu1
Boyan Simeonov
Chris Taggart

Responsible editor: 
Oscar Corcho

Submission type: 
Ontology Description
Company data, ranging from basic company information such as company name(s) and incorporation date to complex balance sheets and personal data about directors and shareholders, are the foundation that many data value chains depend upon in various sectors (e.g., business information, marketing and sales, etc.). Company data becomes a valuable asset when data is collected and integrated from a variety of sources, both authoritative (e.g., national business registers) and non-authoritative (e.g., company websites). Company data integration is however a difficult task primarily due to the heterogeneity and complexity of company data, and the lack of generally agreed upon semantic descriptions of the concepts in this domain. In this article, we introduce the euBusinessGraph ontology as a lightweight mechanism for harmonising company data for the purpose of aggregating, linking, provisioning and analysing basic company data. The article provides an overview of the related work, ontology scope, ontology development process, explanations of core concepts and relationships, and the implementation of the ontology. Furthermore, we present scenarios where the ontology was used, among others, for publishing company data (business knowledge graph) and for comparing data from various company data providers. The euBusinessGraph ontology serves as an asset not only for enabling various tasks related to company data but also on which various extensions can be built upon.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vojtěch Svátek submitted on 05/May/2020
Major Revision
Review Comment:

The paper has been submitted as an Ontology description, and indeed has an ontology as its main subject. The euBusinessGraph Ontology (EBG) is an interesting resource, and deserves publicity within the semantic web community as well as in the business info domain community. The model is freely available (on github); I am however not sure about the license (at least, the paper does not mention it).

As the main downside, I am struggling with the overall scope of the paper, which also includes topics that are only partially related to the ontology itself and significantly increase the paper’s size. The category of „Descriptions of ontologies“ is defined at the SWJ website as „short papers describing ontology modeling and creation efforts“. The paper however has 39 pages (which would have been quite a lot even for a full paper!), of which:
- 10 address the motivations, SotA, requirements and the development process
- 14 provide a reference overview of the ontology
- 13 describe use cases and follow-up projects
- 2 contain the biblio.
The first part is what I would truly expect in an onto paper.
The second part already makes the paper a bit longish, but prevents the reader from having to peep into some documentation/tutorial in parallel, so might still be acceptable, too.
However, the third part is, in my opinion, beyond the scope of a paper of this kind. I could imagine that *very short* (say, 2-3 pages in total) descriptions of use cases, with links, might be relevant for an onto paper. A table showing which parts of the ontology have been used in each use case could be nice, too. But not those 13 pages here: most of the text either does not refer to the ontology at all or only mentions it in an uninteresting way. No modeling challenges addressed while adapting legacy datasets to the ontology are mentioned (e.g., Fig. 20 simply shows that some datasets may use properties not used in other datasets – but that is an inherent feature of the graph model of linked data, not an added value of the EBG ontology by any means). Instead, features of the authors’ company/institute platforms/applications (DataGraft, GraphDB, or the euBG Marketplace) or additional models (ONTO-CG) are described. They might perhaps deserve their own (dataset, application, ontology, or tool/system) papers in SWJ, or be part of some overview paper of the euBusinessGraph EU project, but should definitely not overcharge the current ontology paper. Even the SWJ instructions recommend to only include “*pointers* to existing applications or use-case experiments”.

As regards the scientific and practical impact of the paper (only considering its parts which I perceive as relevant):
- There are no major scientific challenges addressed: the ontology structure mainly looks straightforward, its development has mostly been based on the common practice in the community, and the size of its *novel* parts is relatively small. Actually, it is to a large degree an abstract/prototype dataset schema (reusing many ontologies and complementing them with a few entities in a new namespace) rather than a compact ontology.
- OTOH, on the background of the pretty comprehensive SotA review, such a model might still be useful in practice, even beyond the consortium of the euBusinessGraph project.

An interesting aspect that would be worth elaborating is the relationship between the ontology and the EU legal space. Since the ontology has ‘eu’ in its name, it should pay specific attention to the common knowledge assets of the EU. While it may have been the case during its development, it is not properly explained which parts of the model are EU-specific (esp. those linked to particular codelists?) and what possible caveats might appear if it were used for non-EU data.
And, more specifically: the ontology has been built, to some degree, bottom-up, leveraging on datasets provided by four of the organizations of the co-authors (cf. p.6, lines 33-36). The paper should contain some distinct posterior evaluation on external, ‘retained’ company data sources, indicating that this bootstrap has presumably not induced major gaps or distortions.

As regards the ontology statistics: I am rather confused by the fact the authors list the numbers of classes, OPs and DPs, but do not distinguish how many of them are just reused from existing namespaces (and how many from which), and how many are newly defined. Furthermore, the authors obviously aim to provide the ontology in order to a gap in the ‘market’; however, there is no explicit discussion on those specific (smaller) gaps in the domain that had to be filled with those new ‘ebg:’ entities, such as WebResource or IdentifierSystem. Were there no alternatives for any of these, at all? Or not close enough? Or, not authoritative/popular enough to deserve reuse?
I am generally favorable towards large-scale reuse of entities from existing ontologies provided those are well visible and respectful, and the realization of this reuse by the authors looks sound. Yet, the chosen approach – direct reuse by replication - is just one of several possible reuse options, aside, say, the creation of proxy entities in the new namespace, the direct reuse by import (of whole ontologies) or the reuse by reference (w/o a proxy). The authors should attempt at a discussion of the pros and cons of their solution. Why and how would, for example, the chosen reuse model influence the adoption of their ontology? How convenient is the multi-namespace approach for the data publisher? See, e.g., the older study by Schaible [1].
[1] Johann Schaible, Thomas Gottron, Ansgar Scherp: Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling. ESWC 2014: 457-472
It is also unclear to what degree the various SKOS concept schemes referenced are a ‘part’ of the ontology or not. Can someone claim to completely use the ontology while choosing their proprietary codelists instead of those referred to in the ontology?

As regards the entity namespaces display in the paper, the authors silently introduce the following convention: in the diagrams, the prefix is appended, in braces, to the entity short name, followed with the cardinality; in text the short name is used only. This saves space and allows to focus on the semantic content; however, it also incurs some paging back and forth when the reader wants to recall the actual namespace (even in two steps, to Table 1, if the prefix is not familiar). OK, but it is again a choice that should be explicitly introduced and justified.

The quality of English in the paper is very good. There are just a few typos here and there. A technical problem related to typography is however the low resolution of the diagrams.

Detailed comments to the content:
- p. 3, 32-35: “we look specifically at works dealing with basic information about companies, covering organizational structures of companies, economical classifications of companies, company identification schemes, and locations of companies”. The notion of ‘basic’ company info, as the scope of the ontology, is not properly explained. What criterion is used to decide what is basic or not? Frequency of use in datasets? Some structural, or deeper ontological criterion? And what is, for example, non-basic, then? Especially that the ontology actually even describes *meta-*data on the company-description datasets and ID systems... which is not information about companies proper.
- p.4, 13-14: “The CBV is published by W3C as a part of public working draft named RegOrg since 2013.” Probably the same RegOrg as mentioned in lines 5-6?
- Section 3.1: The CQs only seem to cover three of the modules; there is no CQ for the Dataset module present.
- p.11, Fig. 3: At the first look it might not be obvious why two apparently related classes, ‘RegisteredOrganization’ and ‘Organization’, are not directly linked – whether by rdfs:subClassOf or by some other link. The authors do not discuss this mystery here, and it only becomes completely clear through the example in Fig. 8 much later. If I understand right, an organization in a registry ‘lives in a different world’ than a ‘general’ organization (classified by that can be, for example, the maintainer of the registry. This is however a strong modeling commitment, which deserves some discussion.
- p.11, 44 – p.12, 1. As regards the use of OWLGrEd: OK, it is a nice tool, but how much does it actually offer in this particular case compared to plain UML?
- p.12, 9-10: “We used the Terse RDF Triple Language (Turtle) syntax as the file format for the ontology.” This is fine, but irrelevant for the paper. Any RDF serialization is just RDF.
- p.12, 33-37: “The ontology uses domainIncludes{schema} and rangeIncludes{schema}, which are polymorphic and describe which properties are applicable to a class, rather than domain{rdfs} and range{rdfs}, which are monomorphic and prescribe what classes must be applied to each node using a property. We find that this enables more flexible reuse and combination of different ontologies”. This is pretty laconic and uses non-intuitive terms w/o defining them. What is the exact meaning of ‘polymorphic’ here? By what mechanism does it lead to more flexible reuse? Is the EBG ontology sufficiently similar in spirit to (which has been primarily designed as a mark-up vocabulary for search engines) to justify the copying of this pattern? Which other respected ontologies use the ‘...Includes’ versions of domain/range? Where is it described as a best practice? The choice itself might be sound, but not w/o a more elaborate justification!
- p.15, 1-2: “The operational and/or legal registration status of the entity, e.g., whether a company is active or not. There is no globally accepted list of company states.” Probably, ‘of company statuses’?
- p.15, 26-27 vs. 34-35: It seems that the term ‘geographic coordinates’ is used in two different senses: “Least precise geographic coordinates are resolved at the level of a country” (broader sense, incl. full address) vs. “However, to represent geographic coordinates, was used...” (narrower sense: lat + lon).
- p.18, 1 vs. 9-10: “isPartOf: System the identifier is a part of” vs. “The IdentifierSystem class represents a system managed by a publisher (e.g., a register or agency) that is used to issue identifiers to companies.” It looks semantically a bit odd to consider a system to issue identifiers that are at the same time its part. You can view the collection of IDs as a ‘system’, and you can also view the rules for creating those IDs as a ‘system’, but it should not be the *same* system.
- p.18, 14-15: Following up with the previous comment. The properties schema:author and dct:creator are normally used rather interchangeably afaik. Here you use them to make a very specific distinction: one refers to the author/creator of the system of rules, while the other to the author/creator of the IDs *using* this system of rules. In this context I see the stress on the maximal reuse of common properties as counter-productive. You should make clear what the system actually is (a collection of IDs, or the rules for coining them) and probably one new missing property should be introduced, with a lexical semantics clearly distinct from that of ‘creating’ or ‘authoring’. Either that of a subject that *applies* the system (as rules), or that of a subject that *sets up the rules for* creating the system (as collection of IDs).
- p.18, 44-45: “isPersistent: Whether identifiers can be removed from the register (e.g., when a company is dissolved)” To keep the same Boolean polarity, it should be changed to ‘cannot be removed from’, or ‘has to be kept in... (even if the company is dissolved)’. And, similarly, for ‘isImmutable’ in p.19, 1.
- p.19, 4-5: Same bullet repeated twice.
- p.19, 6-7: “isDumb” Why not rather call is ‘isOpaque’? This is much more technical than ‘dumb vs. intelligent’. Even a lexically meaningful ID does not have any intelligence per se, actually, merely some moderately ‘intelligent’ application can make use of it...
- p.19, 11: “isEnumerated: Whether the system has an issuer, and issued identifiers are kept in a database(register).” What does the opposite case look like? An example would help.
- p.19, 21-23: “replacementPattern: Pattern to use together with the validationRegex to normalize identifier values by removing optional decorations.” An example would be nice here, too.
- p.20, Fig. 7: Maybe OK, but just wondering – is “Issues company identifiers within the Atoka company database” the essence of the business activity of SpazioDati, so as to serve as its schema:description?
- p.20, 38-41: “An officer is a natural person (as opposed to a legal person) that has a high-level management role in a company... they typically serve at the will of the company directors, who can fire or replace them.” Does this mean that directors and shareholders are already beyond the ‘basic’ info on a company? This returns me to the general question raised in the very first comment on this list.
- p.21, Fig. 8: The caption is incomplete, it only refers to the OpenCorporates system and not to the official UK system.
- p.22, 46-43: “VOID describes RDF datasets in terms of entities (i.e., number of triples)” No, void:entities counts the entities and void:triples counts the triples.
- p.24, 26-28: “e.g., age and dateOfBirth attributes are connected by the following rule age=year(today) –year(date-OfBirth)” The example is a bit faulty. If someone is born on 31 Dec 2002 and today is 1 Jan 2020, s/he is barely 17, and not 18 as by the rule.
- p.33-35: The demonstration of the use of the ontology in the marketplace should rather be provided through some instructive diagram showing how data is integrated from different sources while solving a particular query, not as a screenshot of an app. (Provided you wished to keep some concrete use case example in the shortened paper. Most of the content of this section should however be removed, as I noted in the review intro part.)
- p.36: The cg: prefix is not even defined in the paper. The text on an extension of the ontology (though, possibly, interesting) is not an organic part of the paper. ONTO-CG is another ontology, after all.

Language issues:
- p.3, 31-32: “Several ontologies and data models were developed in the literature”. Either ‘developed’, or ‘described in the literature’, but not both. Models are not developed by writing papers.
- p.3, 34-35: “economical classifications of companies”. Rather, ‘economic’?
- p.4, 11: “public organizations, and criterion” Rather, ‘criteria’?
- p.15, 7-8: “that covers jurisdictions NO, GB, BG and statuses from data providers OpenCorporate, and SpazioDati and also from LEI.” Probably, ‘the jurisdictions’, and some connectives and commas should be fixed in this sentence, too, afaik.
- p.36, 30-31: “scholl” School?

Summarizing the evaluation along the standard dimensions for SWJ onto descriptions:
(1) Quality and relevance of the described ontology: solid artifact (with just a couple of likely, relatively minor, flaws), though no major research challenge addressed.
(2) Illustration, clarity and readability of the describing paper: solid, but contains additional parts that do not fit well to the scope of an onto description.

For me, the overall recommendation is obvious: major revision, consisting in removing the low-relevance parts (replacing the long ‘use case’ summaries with very short descriptions, plus possibly a table) and fixing most of the minor-to-medium severe issues in the remaining text (and, if adequate, in the ontology, too, in a few cases).

Review #2
Anonymous submitted on 04/Sep/2020
Minor Revision
Review Comment:

This manuscript was submitted as 'Ontology Description'.

(1) Quality and relevance of the described ontology (convincing evidence must be provided).

The relevance of interoperability in company information is high, and the paper does a good job in highlighting that importance. Particularly, the ontology scope addresses national registers but also other non-regulated potential sources of information, making the ontology useful across a broad range of applications. The scenarios and use cases are good examples of the relevance of the ontology.

The point of departure of the proposal of the ontology is the need to reconcile different existing data models. In that direction, it is properly proposed as a kind of “integration” ontology, with a very specific purpose, that I have used to judge its quality.

The ontology design process has incorporated significant inputs from several organizations that have served the authors as some kind of “test” data. This is in my opinion good practice, since it allows a contrast of the proposed ontology with real data, eventually discovering flaws or inadequacies.

Section 2 provides a good overview to key previous ontologies and schemas. However, in my opinion the state of the art should also in this case provide a more detailed and clear distinction of the limitations of those previous works that motivate the proposal of the new ontology. This might be included in section 2 as additional comments, or along section 4, highlighting the novelties w.r.t. the previous models and ontologies, so that the innovative aspects are clear. Without such analytical comment, it is difficult to identify the key points that the ontology is providing and that were not possible to model with previous ones.

Section 5 discusses relevant examples, but in some cases, it reports on architectural or interface issues that in my opinion are not very relevant to an ontology paper. These details not providing key insights on the use of the ontology itself might be removed if space is needed for more relevant content. Again, focusing the examples in a few, key highlighted novelties or salient features of the proposed ontology (w.r.t. the state of the art) would enhance the value of the paper and more clearly expose the quality of the work done.

(2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.

The paper is clearly written and is free or jargon. English is very readable and the style adequate to a research journal.

An area of improvement is the conclusions section, which in its present form is more a generic summary of the paper. In my opinion, that conclusion section should highlight the main key benefits of the proposed ontology and if needed, the most important ontology elements, structures of patterns that set apart the new ontology from the previous work that was surveyed.

Review #3
Anonymous submitted on 17/Sep/2020
Major Revision
Review Comment:

The paper presents the euBusinessGraph ontology, whose aim is to harmonise and integrate company information across different states. The ontology artefact and the paper describing it have merits, however they both suffer from a number of issues that would need to be addressed before the paper is ready for publication.

The ontology is described in detail together with the methodology used for building the model. The ontology is built in a bottom up fashion, where four main data sources are analysed in order to define the scope and the requirements of the ontology, which mainly focus around the representation of information about four main modules: Registered organisation, dataset, officer and identifies systems. It would have been useful to have more granular requirements and classify them into functional and non functional (whilst not necessarily advocating the use of the Neon methodology such separation of concerns is useful to clarify the concepts to include in the different ontology modules).

The high level ontology model is makes sense and appears to capture the relevant information to satisfy the requirements identified for the ontology and answer the sample competency questions. However there seems to be a disconnect between the requirements identified in section 3.1, the competency questions and the overall stated goal of harmonising and integrating data about business organisations.

When looking at the models for each of these modules identified, they seem to be very close to the original data representation and have a very legacy data centric representation. In particular, many of the classes are described by boolean flags (e.g. isStartup, isStateOwned etc). However, this type of modelling can potentially lead to long term problems in coupling the data with the specific application the data should serve. Boolean data does not offer enough semantics, but its meaning is linked to the application it describes.

Some redundancy is included in the ontology, e.g the RegisteredOrganisation class has a number of object properties that support different classification types, but the classification is described through a specific object property and the corresponding free text. However, this could lead to potential integrity problem and its not clear how this is validated.

Domain and range for the various object properties are not described through rdfs:domain and rdfs:range but through the schema:domainIncludes and schema:rangeIncludes. This choice is somewhat puzzling, especially given that these don’t seem to be fully specified. Stating that this is done to support polymorphism is not sufficient, and it would have been useful if there was a discussion of how these notion of domain and range are handled by GraphQL. Similarly, the choice of GraphQL over Sparql is not discussed in details in terms of pros and cons.

One potential problem with respect to the reusability of the given ontology is that all the requirements and the competency questions are determined bottom up, and hence are heavily influenced by the dataset used. It is not clear how reusable this ontology would be in other countries, e.g. USA and how extensible the ontology is to include new types of registered organisations.

The issues above are not necessarily the result of bad ontological design, but given that they could give rise to problems, this way of modelling the domain should be justified in more detailed.

The paper is generally well written, but sometimes suffers from irrelevant information (e.g the specification of the RDF syntax used as file format for the ontology is not really relevant when describing the ontology modelling choices).
The related work section is quite comprehensive, however it does not mention the ontology reused when building the euBusinessGraph ontology.

In summary, the ontology model presents some issues that will either need to be addressed or will need to be fully justified in the ontology development description, and the paper’s presentation can be tightened.