Linked Data in Enterprise Information Integration

Paper Title: 
Linked Data in Enterprise Information Integration
Authors: 
Philipp Frischmuth, Jakub Klímek, Sören Auer, Sebastian Tramp, Jörg Unbehauen, Kai Holzweißig, Carl-Martin Marquardt
Abstract: 
Data integration in large enterprises is a crucial but at the same time costly, long lasting and challenging problem. While business-critical information is often already gathered in integrated information systems, such as ERP, CRM and SCM systems, the integration of these systems itself as well as the integration with the abundance of other information sources is still a major challenge. In the last decade, the prevalent data integration approaches were primarily based on XML, Web Services and Service Oriented Architectures (SOA). However, we become increasingly aware that these technologies are not sufficient to ultimately solve data integration challenges in large enterprises. In this article, we argue that classic SOA architectures may be well-suited for transaction processing, however more efficient technologies are available today that can be employed for enterprise data integration. In particular, the use of the Linked Data paradigm for integrating enterprise data appears to be a very promising approach. Similarly, as the data web emerged complementing the document web, data intranets can complement the intranets and SOA landscapes currently found in large enterprises. In this paper, we explore the challenges large enterprises are still facing with regard to data integration. These include, but are not limited to, the development, management and interlinking of enterprise taxonomies, domain databases, wikis and other enterprise information sources.We discuss Linked Data approaches in these areas and present some examples of successful applications of the Linked Data principles in that context.
Full PDF Version: 
Submission type: 
Survey Article
Responsible editor: 
Krzysztof Janowicz
Decision/Status: 
Reject and Resubmit
Reviews: 

Review by Dave Kolas

The paper "Linked Data in Enterprise Information Integration" presents an overview of the ways in which Linked Data principles can be applied to enterprise intranets. The paper covers six problem areas:

- enterprise taxonomies
- XML schema governance
- Wikis
- Web portal and intranet search
- Database integration
- Enterprise single sign-on

Overall this paper is well written and does a good job of describing applicable Linked Data technology to the various problem areas. Some of the problem area sections could be improved by an increased focus on the benefits of the linked data approach.

- Enterprise taxonomies

This is the strongest of the six areas in the paper. The paper does a good job of describing the benefits of using linked data, describes technologies that can be used to achieve this benefit, and touches on some of the processual challenges of implementing the technologies. The technologies could benefit however from discussion of alternatives to skos: RDFS, OWL, etc., since skos is not nessearily sufficiently expressive for all domains.

- XML schema governance

This section makes a less compelling case for the use of linked data technology. The focus is on using linked data to create a conceptual schema to which the enterprise's various XML schemas can be linked, for ease of maintenance. However, often the reason that so many XML schemas exist is because a common definition of the repeated concepts has not been declared. It is not clear then how creating a central conceptual schema is organizationally easier than forcing increased schema reuse.

Also, the paper describes the benefits of keeping the XML schemas in an RDF repository as being able to execute SPARQL queries on them, and collaboration via semantic wikis. It should be explicitly stated how the users benefit from being able to execute SPARQL queries, and why collaboration via semantic wiki is superior to collaboration in some other type of XML schema repository.

- Wikis

This section could also use enhancement in the benefit section. The description hints at how the semantic data could be used by other applications, but this should be more explicit. The section correctly identifies as a primary challenge getting users to enter semantic structured data.

- Web portal and intranet search

This section is focused on enhancing internal and external search using linked data semantics. A good example is given of a very structured type of question which generally cannot be answered by keyword intranet search. The cited existing technology in this section however is just a prototype built within a particular organization; this section could use information and citation about existing semantic search systems.

- Database Integration

This section usefully contrasts a federation approach to database integration from an ETL approach to database integration. However, it is weak in showing the benefit of semantic federation versus the more established relational federation. There are compelling reasons for using semantics; however, they are not described in this paper.

- Enterprise sign-on

THis section focuses on using WebID to handle authentication within an enterprise. Again, more clarity on the benefit to the enterprise of using such a system is needed. Also if I understand correctly, a user's certificate is installed in their browser for authentication purposes, which may be problematic in enterprises where a user makes use of many physical computers. While WebID seems particularly useful on the internet where a user will navigate to new websites for which no common authentication scheme is shared, in an enterprise, there is a built-in trusting authority for assigning login information (the enterprise).

This paper fits the theme of the call very well and should be accepted. It could be made stronger by highlighting the benefits of using linked data in some of the areas, and a broader look at possible solutions in others.

Review by Tom Heath

This is a survey paper submitted in response to the "Special Call for Surveys on Application Areas of Semantic Technologies". As the title says, it deals specifically with the use of Linked Data to address information integration challenges in the enterprise.

The paper is overall fairly readable, with just a couple of syntactic issues/typos. It is also structurally sound, having laid out the various challenges the authors perceive in this domain, and then systematically working through these, giving examples of where a Linked Data approach may add value.

Despite these positives, however, I can't recommend acceptance of the paper to SWJ, due to issues of substance, balance and depth of analysis.

In places the paper reads more like a white paper or position paper than a scientific survey. There is a degree of rigour or substance lacking in the analysis that would be passable for an industry white paper but isn't acceptable in a scientific journal. The paper doesn't read like a balanced survey but more a manifesto for Linked Data within the enterprise. While I wholeheartedly agree with the sentiment, the arguments put forward and the evidence used to support them are wholly inadequate. The tone has more a flavour of a position paper, but without any radical proposition or compelling argument. The use of Figure 1 to depict the authors' "vision" reinforces the sense that this paper is misaligned to the call. Survey papers by definition should be analytic and relatively neutral, rather than setting out a vision and finding extensive references to support it.

To accept this as a survey paper it would need to show much greater rigour and balance in the analysis. There seems to be a significant over-representation of the authors' own tools in the discussion, which is not appropriate for a survey paper. On the other hand, at the bottom of page 4, the shortcomings of just one system (MS Sharepoint's Term Store) are used to dismiss an entire class of product.

With these issues in mind, I don't see that the paper delivers on criteria 1) and 2) stated in the CFP. Re 4), while the importance of Linked Data in the enterprise cannot be overstated, I'm not convinced that the material covered is of sufficient importance to the broader SW community for the paper to be published.

One some more specific points:

- Various acronyms are used without being defined on first usage (ERP etc).

- Figures 1 and 2 add little to the discussion.

- There is no reference given for the Linked Data principles in 2.1.1. This can't be treated as assumed knowledge -- practitioners coming across the paper without existing knowledge would likely want background material here.

- The inclusion of a Related Work section in a survey paper is odd. The entire paper should be about related work! As it stands, this section is so brief as to be meaningless.

Review by Sen Xu

Linked Data in Enterprise Information Integration, by Frischmuth, P. et al, 2012 targets at a crucial challenge faced by large enterprises, Data Integration. The paper first proposed a new infrastructure as Enterprise Data Web (EDW), which updated traditional Enterprise IT System landscape by adding Linked Data supported Enterprise knowledge base. Then, six crucial areas, identified as "Enterprise Taxonomies", "XML Schema Governance", "Wikis", "Web Portal and Intranet Search", "Database Integration" and "Enterprise Single Sign-On", were discussed from current practice (see detailed comment 6), Linked Data approach and the challenged faces when applying the Linked Data approach. Demonstration of experimental implementation of proposed Linked Data approach at Daimler were presented. The authors demonstrated huge potentials of the Linked Data approach, while recognizing legimate challenges in shifting from current XML, Web Services and SOA based architecture to a Linked Data based architecture.

The value I see from this paper comes from the following fields:
1. Survey of current practice. On the six areas, this paper provides a detailed overview of the state of the practice. It made clear for future studies to see what is lacking in the current practise, which will stir innovations to overcome these drawbacks.
2. Demonstrating the prototype of Linked Data. On Linked Data, concrete examples of real world implementation is very valuable, and frequently found missing in many paper.
3. Recognizing the challenges in shifting from traditional infrastructure to Linked Data infrastructure. The challenges, such as defragmentation and decentralization in taxonomies, high learning curve in using Semantic Wiki, performance concerns in shifting from RDBMS to Linked Data, are very valid concerns. Linked Data is not a panacea; it should not be recognized as solution to all and this paper exemplfied the drawbacks with concrete examples. This paper provides pointers on what fields linked data development should be focusing on to solve.

My reservation on this paper comes from the following fields:
1. As pointed out below, there are issues in word choice (incorrect use of the phrase "state of the art", see detailed comments below), writing style (see suggestion on writing) and readability of the Figures. After revision these problems should go away easily.
2. A minor issue, which applies to all industry-oriented research, is that some arguments are specific to certain techniques instead of the generalized technology. e.g., the argument on WebID and OpenID. There are many reasons behind the fact that OpenID is not widely used (many of which may be for commercial concerns, some may be even from psychology of users, "why having one more thing to maintain"). The fact that WebID is using X.509 Certificate should not be the primary factor that distinguish one from another. However, this doesn't undermine the value of this paper in the short term.

In sum, this paper provides valuable insights on the path of introducing Linked Data into Enterprise application, specifically data integration. The topics are well segmented, well argued, and well supported with concrete evidence. It fit neatly with the journal's theme and I vote for acceptance with minor revisions.

==detailed comments==
1. Page 1, paragraph 1. "For example, it is estimated that at Volkswagon that..." citation needed for both examples (Volkswagen and Daimler).
2. Page 2, paragraph 1. "In particular, the overhead associated with SOA is still too high for rapid [...]" How does the proposed alternatives (such as ontology-based) solve the overhead issue?
3. What make SOA architecture well suited for "transaction processing" and not suited for "data integration"? make it explicit.
4. Page 2. 4th paragraph, OpenCorporates now have over 40 million corporations, updated the claim "more than 50,000 corporations"; Doublecheck there website at the time of resubmission, it might increase.
5. Figure 1's graphics looks like two Figures coming from two different decades stitched together. Some effort could be made to make the graphics looks more in concord. The LOD cloud is not readable at all. Some generalization might help.
6. Page 3, Section 2, "Finally, we describe the challenges that need to be addressed to make the transition from the current state of the art to the Linked Data approach feasible.": "state of the art" means the most advanced, "cutting edge", in this case, the more advanced technology is Linked Data. Compared to "state of the practice", meaning common/mainstream/current practice. I think you meant "transition from the current mainstream practice to the Linked Data approach feasible.". Same applies to title of Section 2.1.1, table header in Table 1. "state of the art" should all be changed to "state of the practice" or "current practice", and all the titles of "2.*.1. State of the Art".
7. Figure 2. bad choice of color. The colors are too bright and doesn't stand out from each other nor the background (white). Refer to Bertin, J. (Ed.) Semiology of Graphics: Diagrams, Networks, Maps University of Wisconsin Press, 1983.
ColorBrewer(http://colorbrewer2.org/), although targeted at cartography, is also a good resource for choosing readable colors sets.
If the color doesn't have a clear meaning, or there is no clear reason for the Figure to be in color, use black and white.

Regarding the content of Figure 2. The four crucial data integration challenges out of the six identified challenges should be highlighted more explicitly. (From my understadning, it is referring to the square 4 of the 6 elements inside the circle). There are no mentions of the four challenges in text. (Or are you referring to all six?)

8. Table 2. "Enterprise Single Sign-On", "consolidated user credentials, centralized SSO", the last SSO (which means Single Sign-On) could be deleted.
9. Page 4, last paragraph, the critique for "Microsoft SharePoint": "However, there are some strong limitations to this approach. There is very restricted multilingual support "" separate SharePoint language packs need to be installed for each language to be used in the taxonomy."
Multilingual support is a challenge that exists in both traditional taxonomy approach and in Linked Data approach. The fact that in Linked Data it's possible to generate multilingual taxonomy with little additional effort only means solving the challenge easier. Translation of phrases correctly still takes time. Translating sentences/paragraphs are more difficult. Should make the distinction clearer.
10. Page 7. Section 2.2.1, paragraph 2: "Even though XML is available since 1998", citation needed.
11. Page 8. Section 2.3.1, another widely used wiki is MediaWiki: http://www.mediawiki.org/wiki/MediaWiki, although not specifically targeted at Enterprise uses, it still is widely used in corporal environment. Functionality-wise, comparable to the ones mentioned: http://www.wikimatrix.org/compare/TWiki+MediaWiki+Confluence+TracWiki, and is the foundation for Semantic MediaWiki as LinkedData example. It should be mentioned together with other wiki applications.
12. Page 12. Section 2.5.1, "This has the effect, that only key data sources and thus only a small fraction of the RDBMSes in a typical enterprise are integrated." awkward wording.
13. Page 13. the client certificate used in WebID authentication is also applied in current Enterprise SSO systems as well. This is hardly new, and it doesn't solve the "password lost" scenario you mentioned as a flaw in OpenID. If the certificate is lost, user have to apply for or generate a new one; similar to retrieving forgot password. The argument of Linked Data approach should be on safety and accessibility (same certificate for all, in case of changing job/temporary access for contractor, etc.), although the benefit, in my opinion, is not very clear.
15. Page 14, Section 3 seems to be integratable with section 1.

==Suggestion on Writing==
1. Using acronyms are fine but usually the first occurrence should come together with the full name to the acronym (if the acronym is not well known). OEM, XML, RDF, URI, HTTP can be used as is; ERP, CRM and SCM should have full names. Below is a list of acronyms frequently mentioned in this paper that I suggest including corresponding full name on:
ERP
CRM
SCM
EKB
ETL

2. Be consistent on citing websites. Citations that includes only the website URL exists in Section 2.3.1.; footnote consisting only of website URL exists in Section 1 page 2. Decide on one way to cite website URLs and stick to it.

3. A general suggestion for making a list, short titles makes the list more readable. For example, in Section 2.1.2 Linked Data Approach, the list of six benefits could have some short title like:
1. Unique identified that is highly accessible: since terms[...]
2. Enables cross-boundary collaboration: [...]
3. Hierarchical ordering comes for free: [...]
4. Easy to deduplicate: [...]
5. Multilingual support: [...]
6. Easy for re-use: [...]
Same writing technique could be applied to Page 9, OntoWiki's list of benefits.

==Future Research==
There are some further questions that I'd like to see discussed, or in future research:
1. in Page 5, the proposed solution for "enterprise-specific terms [...] not available via DBpedia" is to use keyword extraction service FOX. FOX, short for "Federated knowledge extraction framework", is a simple tagging tool for plain text. It doesn't do well with regard to disambiguation. Ambiguous terms, for example, on wikipedia, "may also refer to" pages, see http://toolserver.org/~dpl/disambig_links.php?limit=500&offset=0. I tested a paragraph with ambiguous terms in FOX and it didn't turns out well:

"China is a town in Kennebec County, Maine, United States. The population was 4,106 at the 2000 census. China is included in the Augusta, Maine micropolitan New England City and Town Area."
The annotated "China" is referring to the country but not the correct reference location "Town of China, Maine".
"
[] a ann:Annotation , scmsann:LOCATION ;
scms:beginIndex "0"^^xsd:int , "103"^^xsd:int ;
scms:endIndex "5"^^xsd:int , "108"^^xsd:int ;
scms:means ;
scms:source ;
ann:body "China"^^xsd:string
"

Another test on "Georgia" turns out even worse:
"Georgia is a state located in the southeastern United States. It was established in 1732, the last of the original Thirteen Colonies.[4] Named after King George II of Great Britain,"

"
[] a scmsann:PERSON , ann:Annotation ;
scms:beginIndex "154"^^xsd:int ;
scms:endIndex "160"^^xsd:int ;
scms:means ;
scms:source <http://ns.aksw.org/
"
Toponym "Georgia" is wrongly recognized as person "George".

The above testing was done on http://139.18.2.164:4444/demo/index.html, the demo page of http://aksw.org/projects/FOX

Of course, the performance of FOX could always be improved and shouldn't undermine the value of your paper; but for building a framework as you suggested in this paper, have you done some evaluation of the performance the proposed tools? By performance I meant in terms of quality (precision/recall) and speed (e.g., how fast they process a sample test set of 1000 document). If some tools's drawback are recognized, can there be some additional components built into the framework to compensate for such drawbacks? This would be an interesting future research.

2. The above examples raise another concern, Linked Data are good at linking different terms to the same entity (e.g., by the same URI), this works out well for multilingual support; however, how does Linked Data deal with ambiguous terms?

Above are open-ended questions that are not critiques, but aiming at expanding research areas originates/mentioned in this paper.

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Tags: