How Graph Data and Ontology May Add Value to Transactional Data

Tracking #: 3130-4344

Minchul Lee
Boonserm Kulvantunyou
Scott Nieman
Bongjun Ji1
Nenad Ivezic
Hyunbo Cho

Responsible editor: 
Guest Editors SW for Industrial Engineering 2022

Submission type: 
Full Paper
In the era of data-driven economy, businesses seek to make data smarter – easier to be analyzed for gaining insights – while deal-ing with numerous sources of the data. One data architecture for integrating data from these sources is the data lake. Data lake captures all the data crisscrossing the enterprise into a single repository for easy and low-cost access in real-time or near real-time without actively syncing data from the sources. During recent years, our industry partners who currently use traditional structured data standard in XML or JSON have posted the questions about the values of graph data and ontology. Therefore, this paper, primarily targeting industry practitioners such as enterprise architect and IT managers, investigates how transactional data stored in the data lake may be integrated and queried for business insights using the XML data versus graph and ontology data. The assumptions are that the raw data follow a common information exchange standard in XML syntax and the storage behind data lake is a NoSQL database. Three experiments were conducted on logistics data 1) using only NoSQL, native API to get to the query of interest; 2) translating raw XML data into graph data without introducing additional formal semantics beyond what is already available in the corresponding XML schemas and use SPARQL to get to the query of interest; and 3) introducing reason-er and additional formal semantics via an OWL ontology into the architecture and use OWL DL Query or SPARQL, which is based on the ontology, to get to the query of interest. While each experiment incurs increasing pre-processing efforts; their differ-ences and values are analyzed and discussed respectively.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Davide Lanti1 submitted on 26/Jun/2022
Review Comment:

*** Overview ****

This paper investigates three different approaches of querying the data in a data lake: (1) by directly using the extracted raw XML data as is, (2) by transforming the raw data into a more abstract RDF/RDFS representation, and (3) by coupling (an adaptation of) the RDF/RDFS representation with an OWL/SWRL domain ontology.

*** Overall Impressions ***

The paper is essentially one big running-example, talking about a logistic use-case, describing three different ways of dealing with the information stored in a data lake. I appreciated the fact that the three scenarios are described in full detail, however I also miss the general picture of the work. In other words, it is extremely unclear to me what the actual contribution of the paper is.

In fact, the work does not propose a framework, does not discover new formal results, does not propose a system or an architecture, does not propose a methodology, and although the authors named the three parts of the running example "experiments", they are really just examples and not really empirical evaluations.

The authors use the running example to substantiate some claims (e.g., that the xml data is most accessible to general programmers than RDF). I feel that this "example-driven" methodology is not adequate to substantiate the claims, and that a proper empirical evaluation grounded on real-world data is really needed here.

The example itself is not devoid of problems. In fact, the way RDF, RDFS, OWL and SWRL have been applied is sometimes not so direct (see Detailed Comments below), and the authors do not really justify their translation rules. I provide an incomplete list of the issues I found in the Detailed Comments" paragraph below.

Summing up, I do not believe this submission falls in the scope of the SWJ, category "full paper", as it looks more similar to a "tutorial" on the use of SW technologies rather than a scientific contribution.

*** Originality ***

I do not see anything original. The literature is full of approaches for mapping XML to RDF, which is never referred to in the manuscript.

*** Significance of the Results ***

This work contains some considerations about a running example, that I would not consider scientific results.

*** Quality of Writing ***

At the level of language, I would say the paper is not badly written. However, there are some layout choices which I found annoying as well as several typos in the technical content:

- The use of one-page sub-figures (without a caption saying what the main Figure actually is). E.g., Figure 7 or Figure 12.
- The plain invention of new terms used as if they were standard terminology. E.g., "data graph", "OWL Application Ontology", "OWL Schema", "OWL instance", "OWL Mapping", etc.
- The inclusion of figures that do not contribute to the discussion. E.g., Figure 11 is just a screenshot of a piece of code that does a System.out.println().
- The use of a \footnote right after a table reference. E.g., Table 2^2, Table 3^3 and Table 7^4.
- A general sloppiness when presenting technical content: just taking Figure 10 as an example, left part, which should describe the XML schema of Figure 7(a), I see that some attributes do not correspond to those in the XML schema (E.g., LineNumberID or the "ID" attrbutes), others are missing (E.g., ShipToPartyReferenceeID should be ShipToPartyReference with a nested attribute ID), others are badly indented (e.g., the children ID and ShipmentRequestOrder of ShipmentRequest).

*** Supplementary Material ***

Not applicable.

*** Detailed Comments ***

I here provide a non-complete list of the issues I found.

- It is the first time I see an abbreviated name (the "Serm" in parentheses) appear in a list of authors, I wonder whether this is an accepted practice.
- Page 2, third paragraph: provide a citation for RDF schema.
- Page 2, third paragraph: the term "data graph" is not standard, nor introduced. Replace it with RDF graph.
- Capitalize all section references: section 3 -> Section 3, section 5 -> Section 5, etc.
- Page 4, first line: "object in a triple is a resource and has a Uniform Resource Identifier" -> technically incorrect, as an object might also be a literal or a blank node.
- Figure 7(a): Bad indentation for element "PartyReferenceType"
- Figure 7(b): "Carrir" -> "Carrier"
- All XML listings contain a lot of unnecessary content (e.g., all the "BaseType" types). I would simplify this, since they do not serve the example.
- Figure 12(a): this figure is neither a shipment request message nor a carrier request message. To avoid confusion, I would explicitly say that this is only a portion of a carrier request message.
- I find the translation from XML to RDF a bit unnatural, and I would have liked to see some justification for it. For instance, why is your translation always introducing URIs, even for values? E.g., consider attribute "sequenceNumber=2" in Figure 12(a). According to your translation, shown in Figure 12(c), this XML information is rendered in RDF through an "mc:hasSequenceNumber" property, which instead of being just a "data property" having value "2" (RDF literal), it is instead an "object property" connecting to a special URI called "#sequenceNumber_2", instance of a class "#sequenceNumber", and connected through the property "rdf:value" to the literal value "2". I am sure the authors had a good reason for proceeding the way they did, however I fail to see the point and the justification is not provided.
- Section 5.3: there is a massive use of "OWL xxx" terms (used as if they were standard nomenclature) which makes the first paragraph very hard to read.
- Table 3, header: remove the "or Property", because properties are handled in the lower part of table that has its own header.
- Table 3, Line 2: the "exactly 1" restriction appears to be wrong, since by reading the XML I understand that a route might have more than one shipping items.
- Table 3: you use the term "ShippingItem", and then also say that a "ShippingItem" can refer to *a number of* items. Then, I find the name extremely confusing. Also, why not to keep the same name used for the XML and RDF experiments, that was "ShipmentRequestOrderLine"? Changing the names of things across the different experiments only adds confusion.

Review #2
By Luis Ramos submitted on 13/Jul/2022
Review Comment:

The paper is very well written, and present the procedure followed in some experiments based in very common use cases for semantic and linked data.
Very clear are the cases of implementing NoSql, RDF and OWL, and their comparison. However, these cases, treatment procedure and expected results are also very well-known, thus the paper, even though it might be very interesting for a certain audience, it is not suitable for being published in the SWJ, because of its lack of novelty.
It presents lessons learned, which is very good, but the paper should be presented to a more application journal, perhaps as a review paper. For this last, it is also necessary a proper review of previous works, from which there is a very large list of related research.
If the authors want to present the paper after proper review, I would also recommend a detail review of previously applied methodologies, so they can strength where their methodology is better, and not limit to an anecdotal narration.