A PROV-Compliant Approach for the Script-to-Workflow Process

Tracking #: 2047-3260

Lucas Carvalho
Khalid Belhajjame1
Claudia Bauzer Medeiros

Responsible editor: 
Guest Editors Semantic E-Science 2018

Submission type: 
Full Paper
Scientific discovery and analysis are increasingly computational and data-driven. Scripting languages, such as Shell, Python and R, are the means of choice of the majority of scientists to encode and run their simulations and data analyses. Although widely used, scripts are hard to understand, adapt, reuse, and reproduce. To tackle the problems faced by scripts, several approaches have been proposed such as YesWorkflow and noWorkflow. However, they neither allow to fully document the experiment nor do they help when third parties want to reuse just part of the code. Scientific Workflow Management Systems (SWfMSs) are being increasingly recognized to mitigate these problems. They help to document and reuse experiments by supporting scientists in the design and execution of their experiments, which are specified and run as interconnected (reusable) workflow components (a.k.a. building blocks). Taking this into account, we designed W2Share, a novel approach for the management, reuse, and reproducibility of script-based experiments. W2Share transforms a script into an executable workflow that is accompanied by annotations, example datasets and provenance traces of their execution, all of which encapsulated into a workflow research object. This allows third party users to understand the data analysis encoded by the original script, run the associated workflow using the same or different datasets, or even repurpose it for a different analysis. W2Share also enables traceability of the script-to-workflow process, thereby establishing trust in this process. All processes in W2Share follow a methodology that is based on requirements that we elicited for this purpose. The methodology exploits tools and standards that have been developed by the scientific community, in particular YesWorkflow, Research Objects and the W3C PROV. This paper highlights the main components of W2Share, which is showcased through a real world use case from Molecular Dynamics. We furthermore validate our approach by testing the ability to answer competency questions that address the script-to-workflow process.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Paul Groth submitted on 21/Jan/2019
Major Revision
Review Comment:

This paper describes a methodology for the conversion of scripts (e.g. python, shell scripts) into a combination of an abstract and concrete workflow with associated data in order to facilitate reproducibility and reusability of a scientific experiment. This is an extension of an existing work published at the e-Science conference.

Overall, I like the direction of supplying assistance to users with the transition from "creative" scripts to more reusable artefacts. The paper puts together a number of existing technologies into the methodology. I think this is a good use of existing work and I like how it builds on top of what's other out there.

I do think the paper needs to address a some things before it's published as a journal paper.

1) Evaluation

The entire evaluation is based on competency questions.

A side note: I don't think know why you use "MatWare: Constructing and Exploiting Domain Specific Warehouses by Aggregating Semantic Data" for the definition of competency query. Is there a difference between "competency questions" and "competency queries"? Competency questions are the most commonly used approach for ontologies. The canocial reference is http://stl.mie.utoronto.ca/publications/gruninger-ijcai95.pdf

I'm not opposed to this kind of evaluation technique but I find the questions to be entirely too tailored to the system design. Good competency questions are generic natural language questions that in general a system should be able to answer. My question I guess is whether or not you could reuse these competency questions for another provenance/reproducibility system. I think you need to make an argument for this and maybe re-characterize the questions. Another option would be to check the questions against another system entirely. I'm not sure if that's required but it would make for a stronger piece of work.

Also, the use of competency questions as an evaluation technique has been used for other provenance systems. See:

- Ram, Sudha, and Jun Liu. "A new perspective on semantics of data provenance." Proceedings of the First International Conference on Semantic Web in Provenance Management-Volume 526. CEUR-WS. org, 2009

- Provenance Oriented Reproducibility of Scripts using the REPRODUCE-ME Ontology. S Samuel, B König-Ries

I would expect at least a comparison to those. You might use those as rational for the questions you introduce?

Note the REPRODUCE-ME system also uses no-workflow and an ontology. Can you compare to that?

2) Publication of model in an open way.

The model you develop as a UML diagram seems like a good contribution. Why not publish it on the web as linked data?

3) Too strong claims

- On page 9, line 30, you claim generality. I'm not sure you can take Taverna to be representative of all workflow systems. I think it would be good to argue or at least refer to some related work that surveys workflows systems to make this claim.

- On page 10, I would claim that your introduction of machine readable abstract workflows is too strong. The Wings workflow system for example has a strong semantics of abstract workflows. It's tied to comprehensive catalog of data and software described in an ontology. I think it would be best to dial back this argument and instead make the argument that you provide novelty with respect to scripts.

4) When to use workflows and when to use scripts

One of my questions when reading the paper is when does the scientist or script write flip to the workflow environment? It's not a standard one way conversion where you sit in the script environment and then do a one way conversion. The approach implies that you should be editing workflows directly (Step 3). But as we know scientist don't do that. I think this should be discussed in the paper.

Minor notes
I don't think it's necessary to always refer to the prior paper. I think it's good enough to mark this in the introduction and where it's necessary in the text.
- I did some work on converting juypter notebooks to provenance and back that predated NiW and tries to address this script-to-provenance-script reproducibility case: Adianto Wibisono, Peter Bloem, Gerben Klaas Dirk de Vries, Paul T. Groth, Adam Belloum, Marian Bubak: Generating Scientific Documentation for Computational Experiments Using Provenance. IPAW 2014: 168-179

Review #2
Anonymous submitted on 10/Mar/2019
Major Revision
Review Comment:

This paper describes a methodology and its implementation (called W2share) to support the transformation of scripts into executable workflows, and the aggregation of these resources along with other related resources (e.g., execution traces and annotation) into research objects. The authors list five contributions: I) the methodology (described previously in another paper); ii) the data model identifying the main elements and relations of the methodology, iii) the implementation in W2Share, iv) a case study to showcase the solution, and v) an evaluation via competency questions.
The work presented is generally interesting and relevant, well written and easy to follow, and it could in theory provide benefits to researchers and to science in general; however I have many concerns regarding practicality of this approach, the implementation, which seems like a very early prototype at best that is not working as expected and does not cover all the mentioned aspects, the evaluation that is rather weak, and many inconsistencies in the paper with the models and the data used. Very disappointing is that the final output (i.e., the research object) is not even created correctly, which fails to demonstrate the approach even with the example presented (see comments below).
Reproducibility and reusability also will require access to the resources on which these workflows depend, i.e., web services, datasets, etc., addressing workflow decay issues, which is an aspect not even mentioned. The methodology is said to contemplate version control systems to track changes of scripts, but this is not further elaborated and just mentioned in the future work.
Related work is discussed, however regarding tools like Jupyter notebooks, it may be worth mentioning that some extensions and tools that enable to build notebooks as reusable modules or that allows code reusing, specially for python (e.g., [1]).
More detailed comments are:

**** data model

It is not clear the benefit and role of the data model, i.e., although authors claim it supports the (semi) automatic script-to-workflow conversion, it is not clear how it is practically used and/or its relation with the implementation, which makes it slightly confusing.

Why the data model does not use the same approach for annotation as research objects (which are actually the final product of the methodology)? What is the relation between the data model elements and the terms in wfdesc, wfprov and other RO ontologies? The data model does not consider an element for workflow runs and/or a way to identify and collect information about multiple runs from the same workflow.

**** evaluation
The evaluation is rather weak. It only accounts for the technical aspects (e.g., once information was manually represented using the underlying ontologies, the ability to answer queries that are potentially interesting for scientists).

The questions address the requirements used to design the methodology, which in turn are derived from the authors’ experiences during their collaboration with scientists. It is not clear, though, if these requirements were derived using some formal approach or not, e.g., if (and how many) scientists were evaluated or surveyed? It is further not evaluated the real applicability of the proposed approach, i.e., given the tool, were real scientists able to actually use it, and take benefit of it?

One of the requirements deals with quality assessment. Is the quality only assessed on the basis whether the workflow reproduces scripts results (within some tolerance threshold) ? Are there other workflow quality parameters considered by the methodology? This is an important task, but it is no further elaborated in the methodology or tools. Can quality change over time ? Can this assessment be supported by the tools ? Provenance capture during transformation is also mentioned in Section 5, but is not demonstrated (see comments below). Similarly, quality annotations in W2Share website does not seem to work (see comments below)

*** implementation

One disappointing issue is that the output research object of the use case (created in 2016?) is not correctly created. The original script is not aggregated (and is missing - nowhere to be found), some aggregated resources (from manifest) are missing, e.g., provenance/script-workflow.prov.ttl, aggregated resources are not defined in manifest (e.g., type ro:Resource) and their provenance information is missing (e.g., creator & creation date), relevant annotations are missing (e.g., type of resource like workflow or dataset), etc. More importantly, all the provenance information, generated during the conversion process, is missing. Execution provenance could also be aggregated as nested ROs, as the bundles generated from Taverna are ROs themselves. The paper aim to demonstrate the benefit of the methodology creating a final research object from a script, which can be easily shared and re-used, but the final product of the example is not even correct.

Another disappointment is that the W2Share was not working as expected at the time of reviewing. First, the script converter step was not working properly, so the first step (extract workflow topology, transform into ontology-based structure, add provenance info) could not be replicated/checked. It was impossible to get the abstract workflow (and the annotations linking the workflow to the script).
Some of the scripts available in W2Share Website have incorrect graph or missing abstract workflow.
The script that seems the excerpt from the use case example [2], does not properly display graph image. It has in theory an abstract workflow, but it is in fact a t2flow implementation, i..e. no file with wfdesc description, and no provenance information that link it back to S.
W2Share Website also includes a Quality Flow to add quality information, but does not seem to work. It is possible to add quality metrics, but not to add quality annotations.

Regarding transformation of abstract workflow to executable workflow, it is said that W2share enables this process to be done (semi-) automatically. Again, this cannot be validated. Even if step one works, If the abstract workflow is a wfdesc description how can this be used to create the executable workflow semi-automatically? If the abstract workflow is a t2flow, it may be easier, but then how this is created automatically in step one? And when is the wfdesc description created ?

***** inconsistencies

The full original script is not available, but many inconsistencies between the appendixes, the figures and text in the paper can be spotted.
*For example in Figure 5, process split has input structure_pdb, while in Appendix A.1 the inputs are initial_structure and directory_path (directory), similarly process psfgen has some inputs in the figure that do not appear in the appendix, or outputs variables are named differently.
*Figure 8 has also many differences with its corresponding appendixes (A.2, A.3), e..g, names of parameters, titles/labels, types, etc.
*Subprocess provenance information is not clear, e.g., wasderivedfrom an Entity, but how to know which entity? It is possible to obtain it because the process is a subprocess of which was derived from , but there are no explicit semantics to infer it automatically.
*According to the annotations in your example, Query 2 would give you both the abstract workflow and the executable workflow as result, i.e., and are both prov:wasDerivedFrom
* Query 4 is said to identify, given an executable workflow We, which script blocks originated each workflow activity. However, the query only uses abstract workflow information, not the executable workflow. In fact there is no information about the workflow process in the executable workflow, besides the fact that it was derived from the process in the abstract workflow.
* Query 5 retrieves the agent responsible for the abstract workflow creation, but it is not known if this agent annotated or curated the script.
* Query 6 will not work with variants, unless prov:wasDerivedFrom would be transitive.
* Results in table 7 do no match with data in appendixes, besides in order to get the real inputs/outputs, the query should also include tavernaprov:content

**** RO ontologies
* Wfdesc ontology makes a separation between abstract workflows and workflow implementation that is not taken into account in the example. The executable workflows and its variants are declared in the appendixes as wfdesc:Workflow. Why are they not (also) declared as wfdesc:WorkflowDefinition, which is the class used to represent an actual workflow implementation?
* There is also errors in the annotations of the example: wf4ever ontology does not define any workflow class. Workflow classes are defined in wfdesc.

* W2Share, a computational framework or W2Share, a computation framework
* First line page 22, “…and the latter may be derived from script Code Blocks” . It seems latter should be replaced by executable/refined workflows.
* query 1, table 2: variable is process not processor

[1] https://post2web.github.io/posts/reuse-jupyter-notebooks/
[2] http://w2share.lis.ic.unicamp.br/script-converter/details/2c840bd2943ca9...

Review #3
By Idafen Santana submitted on 12/Mar/2019
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

(1) originality: the main contribution of the paper is the assemble of existing tools into a new framework, including a methodology, to support experimental shareability

(2) siginificance of the results: an experimentation is provided, main problem would be that it only covers one experiment, thus lacking of generalisaion.

(3) quality of writing: the paper is well written and is easy to read and follow.

This paper introduces a novel approach for transforming script-base computational experiments into scientific workflows. For that, it relies on combining existing tools and approaches, extending previous contributions, to set up a framework for increasing the understandability and reproducibility of the experiment. Authors provides an evaluation study, based on a real-case scenario from the Molecular Dynamics field. The resulting workflows consist no only on the workflow itself, but on an aggregation of resources, producing a shareable research object.

The paper is generally well written and structured.

Whereas the framework itself in complete and useful, and the contribution is sound, my main concern with this approach is its usability for developing scientific experiments, and whether the very idea of transforming scripts into workflows is worthy taking into account the amount of effort (including manual intervention) that is required.

On the second paragraph on page 2 authors state that "workflows are better than scripts for understandability and reuse", which has been demonstrated to be true in computational science. However, it is not clear that scientists, specially those used to develop script-based experiments, are knowledgeable about them. Thus, asking them to transform their experiments into workflows, might be a challenging task. Even when a methodology is provided a part of the contribution, the manual annotation of the scripts required for transforming them into WF seems a bit complicated. Identifying "units and dependencies within the script" (requirement 5) is not an straightforward task for a non-expert in scientific workflows.

On step 5, the bundling process encapsulates the auxiliary resources required, namely "annotations, provenance traces, datasets, among others". This generated the WRO, which according to authors is meant to be a self-contained bundle tha enables to share and reproduce the experiment. However, the underlying software/hardware seems to be missing. It is hard to support the reproducibility (or replicability, reusability, etc) of a computational experiment without considering the execution environment and dependencies.

The concept of a Curator is mentioned in section 3, which I consider a highly interesting idea. As stated before in this review, there is a clear need of an expert in the area of WF-based experiments in order to produce bundles that are reproducible. This concept could be further explained and discussed, elaborating on the curator's role and activities.

In section 4, when discussing the use case scenario from Molecular Dynamics, authors state that Taverna as "it supports the execution of shell scripts, the script language adopted in our case study". It is not clear to me up to which level the design of the experiment restricts the selection of the WMS, and what would be in general the coverage of WMS and experiments.

Figure 8 poses a clear overview on the amount of annotations required by the user in order for the system to be able to generate the semantic representation of the experiment. This is an interesting approach, and as stated by authors, it has been proved to be useful. However this still poses a challenge to the final user (i.e. scientits that developed the script).

The evaluation of this contribution is sound, and the competency questions and related queries show the potential of this approach. These queries would enable a deep and comprehensive analysis of the experiment from different perspectives. This certainly is an "step towards fully reproducible research", but as mentioned, is still challenging and lacks of considering relevant aspect (e.g. execution environment) to be a complete approach. My main concern on this regard is that the evaluation is limited to only one experiment in a given WMS. Thus, it is hard to state how general the approach is.

Besides the issues raised above, the contribution of this paper is sound and able to produce highly useful and interesting results, which are worthy for the scientific community. Overal, the paper should work on decribing how the contribution proposed could be deployed (or stress sucessful stories of its usage) among scientific communities.