Reengineering application architectures to expose data and logic for the web of data

Tracking #: 1681-2893

Authors: 
Juan Manuel Dodero
Ivan Ruiz-Rube
Manuel Palomo-Duarte

Responsible editor: 
Aidan Hogan

Submission type: 
Tool/System Report
Abstract: 
This paper presents a novel approach to reengineer legacy web applications into Linked Data applications, based on the knowledge of the architecture and source code of the applications. Existing application architectures can be provided with linked data extensions that work either at the model, view, and controller layer. Whereas black-box scraping and wrapping techniques are commonly used to add semantics to existing web applications without changing their source code, this paper presents a reverse engineering approach, which enables the controlled disclosure of the internal functions and data model as Linked Data. The approach has been implemented for different web development frameworks. The reengineering tool is compared with existing linked data engineering solutions in terms of software reliability, maintainability and complexity.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Paul Groth submitted on 04/Aug/2017
Suggestion:
Major Revision
Review Comment:

This paper describes an approach for reengineering Model View Controller (MVC) applications such that they expose Linked Data. The approach, named EasyData, focuses primarily on web applications. Summarizing the approach, the idea is to modify the Model of a typical web application (implemented typically by an Object-Relational Mapper) to also output data according to web vocabularies, to modify the View so that data is presented with RDFa / Microdata, and to modify the Controller such that the APIs are Linked Data compatible.

To help developers perform reengineering, two packages were developed for two popular web frameworks Ruby on Rails (EasyData_Rails) and Django (EasyData_Django). In terms of evaluation, the Rails package was applied to Redmine (redmine.org) an open source project management application. Secondly, software metrics were calculated for EasyData_Django and compared to 5 other software packages (e.g. Stanbol, D2Rq) that are also designed to help create Linked Data exposing applications.

First, I think this is an important problem to address. It’s not always straightforward to make applications Linked Data compatible and providing packages that focus on common development environments is a good one. The authors do a good job of defining the research methodology they are using from Oates. But I would have liked to see more details of what each of the steps actually required. Adding an additional paragraph describing what each of these steps require would be helpful.

There are two areas where the paper/tools need to be improved.

1) Evaluation
I liked the approach of using a complex case study that’s open source, but the outcome of the application to redmine was not shown. What did the resulting project management software do? Figure 3 shows a service API but the namespace is published in example.org. It would be good to provide a place to download the updated version of the software and screenshots in the paper of what the results of the application of EasyData look like.

Again the authors provide a unique approach to evaluation with the application of software quality metrics. I really liked this approach. But it’s unclear why this validates the EasyData reengineering approach. This just says something about the EasyData software quality. While an interesting finding, the link is not made clear. Also, because EasyData Django is newer code one could argue that it will show less code complexity and code density than older software such as D2RQ.

What would have been of interest is a comparison of the software quality of software constructed using the multiple different approaches. Essentially, answering the question does the EasyData approach lead to better software than other existing approaches.

2) Software usability / evidence impact
A key question for a Tools paper is the impact of the tool. Currently, no evidence is provided for large scale uptake. Another way to measure the potential of use is the quality of software itself. While the software metrics present give some indication of that, I think a key part of tool uptake is software usability this includes documentation. I would have liked to see a small tutorial at the GitHub repo which would have allowed someone to apply the approach to a small django app.

Overall, I think there is more work to be done in providing evidence for the tool and methodology’s applicability but there’s definitely something here.

Minor comments:
- “Web of data” —> “Web of Data”
- define LD at first use.
- Look at combining footnotes 1 & 2 as footnote 2 relies on footnote 1’s definition.
- You should provide a link to the redmine application website.
- In the introduction, it is unclear the Linked Data (LD) principles that are being referred to. I assume it’s the classic Berners-Lee design principles https://www.w3.org/DesignIssues/LinkedData.html but Hausenblas is cited.
)

Review #2
By Christoph Lange submitted on 16/Oct/2017
Suggestion:
Major Revision
Review Comment:

This paper presents EasyData, an approach for reengineering legacy web applications to make them publish linked data. The model, view or controller components of existing applications can be extended to publish linked data. EasyData has been implemented for the popular Ruby on Rails and Django web application frameworks. A comparison with other state-of-the-art linked data publishing platforms w.r.t. several software metrics shows that EasyData performs comparatively well.

Let me first address the specific review criteria for a tool/system report:

(1) Quality, importance, and impact of the described tool or system: Recency is also a quality criterion. EasyData_Rails was last updated almost 5 years ago, EasyData_Django 3 years ago. Other than README files and a few comments in the source code, I don't see much documentation. Looking at the repositories, there are no signs of activity: few contributors, no issues, no forks (other than your own ones). I don't see evidence that anyone has used EasyData, except yourself for adding linked data support to the Redmine project management tool. Also, this extension of Redmine just seems to exist as an example within the scope of this article; I don't even see where it can be downloaded.

(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool: The paper itself is largely well-written. Section 1 clearly states the importance of adding linked data support to web applications. Section 2 gives a well-structured overview of existing reengineering approaches, with a focus on the model-view-controller (MVC) architecture. One issue here is that MVC is rarely applied in a pure way. Can you also address variants or partial implementations of MVC? In Section 3, the capabilities of EasyData are presented clearly. The evaluation by a proof-of-concept of adding linked data to Redmine, and by a comparison with other approaches, in Section 4 is comprehensible – but, and that's the weakest point of the paper, it could be a lot more convincing – that's basically what I require as "major revision". Except for one minor observation, no lesson learned from Redmine is presented. Also, I'm not convinced by Table 1. Linking instances of a web app's data to linked data resources would have a much larger impact than linking the limited terminology of a software to DBpedia. Regarding the comparative evaluation it is not sufficiently clear whether the competing linked data reengineering approaches are actually comparable to EasyData w.r.t. the chosen software metrics. Competing tools might have a much broader functionality, and might thus have a larger code case while at the same time suffering from more vulnerabilities. Any observations about EasyData should therefore be interpreted in relation to its small code base.

For further details please find an annotated PDF with detailed comments at https://www.dropbox.com/s/ka2y2yvlpvd94jg/swj1681.pdf?dl=0.

Review #3
Anonymous submitted on 19/Nov/2017
Suggestion:
Major Revision
Review Comment:

# Summary and general comments
The authors present an approach to extend MVC-based web applications and expose their internals as LOD. Specifically, they provide two library implementations - EasyData/Rails and EasyData/Django - that can be used to (i) expose the application data model, (ii) expose the controller methods via a LOD API (described with SAREST). By deeply integrating the Linked Data-related code into existing applications rather than relying on black-box scraping or wrapping techniques, the approach aims to expose the internal data model and functions directly as Linked Data.

The paper contextualizes the approach well with a good overview of existing approaches towards exposing Linked Data in existing web applications and discusses some of the pros and cons.

Overall, the approach is not groundbreaking as individual developers have long been extending their application's internal code to expose data and API descriptions as LD (using, among others, the very same standards and techniques as those used by "EasyData"). Nevertheless, the two implementations could make it easier for developers to expose applications' internals with less custom code and a paper on the topic can make a valuable contribution as a "tools and systems report".

An obvious disadvantage of the approach is that this tight coupling eliminates separation of conerns in a separate LD layer, which negatively impacts modularity, reusability, and maintainability. The authors acknowledge these limitations in their discussion. IMO, it will be difficult to convince general web developers to openly expose their application's internal data model and functional structure as LD. Nevertheless, the area on the architectural spectrum (i.e., between converting the source data to LD and scraping the views - none of which are particularly satisfying solutions) positions the paper in a key area where progress is necessary.

A weak point is that the paper does not highlight the benefits and consequences of the proposed "invasive" approach more thoroughly. The main argument put forth is that the "invasive" approach provides hooks to implement security and access control. While this argument is clearly important, maybe this could be embedded in a broader vision - i.e., by outlining what would become possible if web applications adopted this approach.. Also, a good motivating illustrative example (focusing not just on example code, but also providing a motivation) would be useful.

# Quality, importance, and impact
The provided evidence of impact is somewhat limited. As is, the paper seems more like a proof-of-concept with a few code examples rather than a report on a mature tool that is in actual use (redmine is used in the examples, but it was not clear to me if a complete LD extension of redmine based on the approach exists and is in active use). In terms of validation, IMO an implementation on a larger scale (and ideally deployment of the approach in production use) would instill more confidence in the quality and impact than the (still useful) comparison of software quality metrics with SonarQube included in the paper. Also, the GitHub pages of the two implementations do not seem particularly active (last commits 3 and 5 years ago, respectively). Overall, importance and impact are a bit unclear.

# Clarity, illustration, and readability
Capabilities and limitations of the tool are discussed. My main concern, however, is clarity, illustration, and readability of the paper; this is partly due to general problems w.r.t. grammar and style, but also partly due to terminological issues throughout the paper. Some of the more important ones are:

*) "reverse engineering": I'm not sure if extending the source code of an application should be considered "reverse engineering"; IMO, "reverse engineering" typically refers to a situation where the source code is not available. Also, the authors seem to use the terms "reverse engineering" and "reengineering" interchangably.. I suppose "reengineering" is the intended meaning.

*) I find the term "legacy web applications" as used in the paper somewhat vague. The authors do not provide a definition, but the phrase "legacy web applications that do not expose their data using the common formats and protocols of the Web of data" suggests that they consider any web application that does not provide Linked Data "legacy", which is a view that is probably not shared in the wider Web development community, where applications built on archaic web technology stacks or standards might be considered "legacy".

Apart from such terminological issues, the general wording should also be more precise and concise. I suggest to remove unnecessary filler words that do not contribute to the meaning (see the detailed comments below) and a thorough revision and rephrasing, where appropriate (also see detailed comments below).

# Overall Assessment
Overall, I recommend a major revision of the paper that should strive to more clearly highlight the vision and benefits of the proposed approach, provide evidence of the proposed tool's impact (or at least of its applicability beyond simple examples), and significantly improve clarity and readability.

# Detailed comments

## Title, Abstract, Header
* Title: "expose data and logic for the web of data" → "on the web of data"?
* "Undefined 0 (0) 1" in the header should be replaced (at least journal name)
* Word "Abstract" formatted according to journal template?

## Introduction

* "The Web of data is largely concerned with procuring web applications that publicly display their information by means of metadata and explicit semantics, such that entities from heterogeneous web information systems can be interlinked [17]."

→ I cannot find this statement anywhere in the cited paper. The wording is a bit odd (the web of data "is converned with" "procuring"?), so it's difficult to discern the intended meaning, but I wouldn't say that the web of data is about "procuring web applications".

* "Best practices of Linked Open Data (LOD) software engineering ... [18]"
→ the cited paper is not about software engineering, but about Linked Data publishing.

* I also don't think that it's fair to say that the Linked Data principles "provide a guide to reengineer legacy web apps" since they are not about reengineering, but about publishing data on the web.

* "providing an existing web application with LOD capabilities" - wording: do you mean "extending"?

par 2:
* "web scrappers" → "web scrapers"
* "diverse middleware" → remove "diverse"?
* "Lots of LD techniques" → "A lot of"

par 3:
* "not insignificant" → "significant"
* "diverse software quality features" → remove "diverse"?
* "particularly concerning with" → remove "with"

p.2, par 2:
* "The research methodology followed for this aim" → "to achieve this aim"?
* "articulated" → do you mean "described"?
* "based on the discernible software architecture of most web applications" → not sure what you mean with "discernible" here.
* "followed by a consolidated discussion" → remove "consolidated"

## Section 2

p. 2, par 1:

* "The architecture of LD applications are discussed" → "is discussed"
* "Alternative patterns have a very low query execution.." → something seems to be missing, "low" what, performance? Also, it would be useful to explicitly name these "alternative patterns".
* "is made up of" (twice in a sentence)→ "consists of"

## 2.1. Reengineering strategies

* "probably because the application" → "typically"
* "on one hand" → "on the one hand"
* A figure (e.g., table) that relates application characteristics (availability of source code, supply of a built-in information exposure facility, disclosure of DB contents) to applicable LDAA patterns would be useful.

p.4:
par 1:
* "The MVC architectural pattern is the most frequent" → "most frequently used"
* "web scrapping" → "web scraping"

item 3:
* "normally designed with" → "typically designed with"

# 3 The EasyData LOD extension strategy
p.4:
* "EasyData is the name of a new approach to LOD extension in order to reengineer legacy MVC-based web apps." - redundant? Introduce EasyData in the previous sentence and remove?

p.5:
* "Since access control to the generated LOD resources should not be granted for everyone"
→ replace "Since" with "If"?
→ do you mean "access ..should not be granted" rather than "access control .. should not be granted"?
* Figure 2 caption: remove "Applicable"?
* remove "therewith the"
* "The EasyData procedure is practicable as long as the application source code is available." → "The EasyData strategy is applicable if the application source code is available."
* "This is granted in open source" → "This is the case with open source software"
* "scrapping" → "scraping"

# 4 Evaluating..

par 1:
* "Two different prototypes" - remove "different"?
* "Each prototype serves the EasyData procedure to be applied" - rephrase wording?
* "such as Ruby-on-Rails and Django" → remove "such as" if you have implemented the approach for exactly these two frameworks.

par 2:
* "How the revealing,... is explained next." → "Next, we explain .."
* Figure 3 caption: not sure if "revealing" the application data model is the correct verb to use here, maybe "mapping", "annotating".. would be more appropriate?

par 3:
* "After revealing and mapping": Again, what is the difference between "revealing" and "mapping"?

p.7 Step 4
* "access control grant can be configured" → remove "grant"
* "generated the previous steps" → "generated in the previous steps"

## 4.1 Comparison..
* "The tools have been selected as long as they fulfill.." → "based on a number of conditions"

## Discussion
p. 10:
* "Therefore external browsers might have not the appropriate authorization privileges.." → "Therefore, external browsers may not have the required access priviliges"
* "an unified security access control layer" → "a unified security access control layer"


Comments

Informal feedback from a person who declined to review:

I have read it and I think the content is generally ok and the research solid, but it requires a significant re-organization. A different Title would be a good start. The abstract and introduction should also focus more on the authors work and what it is.

e.g.

EasyData: re-engineering existing apps for the semantic web

In the abstract I am missing EasyData does ... is compared to ...

In the comparison to other tools with the SonarQube static analysis it fails due to the language differences between the tools. These are not metrics are not comparable. I think that the explanation about security issues is incorrect here as well.

All in all as a reader I would expect more explanations about the EasyData approach and how it actually works. Which kind of MVC apps work? which languages/toolings? just Djanga and Rails?