Automatizing experiment reproducibility using semantic models and container virtualization

Tracking #: 2128-3341

Carlos Buil Aranda
Idafen Santana1
Maximiliano Osorio

Responsible editor: 
Guest Editors Semantic E-Science 2018

Submission type: 
Full Paper
Experimental reproducibility is a major cornerstone of the Scientific Method, allowing to run an experiment to verify its validity and advance science by building on top of previous results introducing changes to it. In order to achieve this goal, in the context of current in-silico experiments, it is mandatory to address the conservation of the underlying infrastructure (i.e., computational resources and software components) in which the experiment is executed. This represents a major challenge, since the execution of the same experiment on different execution environments may lead to significant result differences, assuming the scientist manages to actually run that experiment. In this work, we propose a method that extends existing semantic models and systems to automatically describe the execution environment of scientific workflows. Our approach allows to identify issues between different execution environments, easing experimental reproducibility. We also propose the use of container virtualization to allow the distribution and dissemination of experiments. We have evaluated our approach using three different workflow management systems for a total of five different experiments, showcasing the feasibility of our approach to both reproduce the experiments as well as to identify potential execution issues.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 16/Apr/2019
Major Revision
Review Comment:

The paper describes DockerPedia a Docker-based experimental workflow engine that aims at fostering reproducibility of scientific experiments using
semantic technologies.

The work touches a critical problem, automation in the context of reproducibility of scientific experiments. The author's contributions include:

- Dockerpedia annotation API (extending Clair and Docker engine)
- An ontology for annotating the workflow data
- a replication study, which includes four scientific workflows used for validation.

The paper's well written although sometimes over-complicated.
On the other hand, there are a lot of repetitions and the overall organisation can be improved.

## General Comment

Although the idea is exciting and promising, some significant issues should be solved.

1) The authors based their conceptual framework on [3], which distinguishes between a) logical conservation achieved by sharing code, data, and metadata, and b) physical conservation achieved using virtual machines.

The authors present containerization as a way to achieve physical conservation, preferable to virtual machines because it requires less disk.

Nonetheless, Virtual machines are the **only** way to achieve physical conservation as it is intended in [3]. Indeed, the presence of the hypervisor is compulsory to have full portability and make the execution architecture-agnostic.

Although containers are referred to as lightweight virtualisation techniques, no actual virtualisation is provided. Instead, isolation is achieved using cgroups and namespace. Therefore, containers are architecture-dependent. In general, it is not possible to run a container on a different architecture from the one it was built on, e.g., Intel and ARM. There are some exceptions like JVM based applications.

Moreover, containers are no more than data, code, and metadata. Images are read-only file systems that contain all the packages, plus the metadata generated during the build process. Therefore, in the proposed framework containers are an advanced method for logical conservation.

In summary, the contributions should be rephrased according to the fact that containers are not virtualisation techniques and, thus, cannot achieve the same results of VM in terms of portability.

2) The paper mixes two messages resulting a bit hard to follow in some of its parts.
Indeed, it seems the authors are both
(a) proposing a new method for fostering reproducibility of containerised workflows
(b) advocating for the choice of containers against virtual machines.

Authors do not need to sustain the latter argument as they provided enough evidence of in in the state of the art section.
Indeed, as they clearly pointed out, there is an active research community that already adopted containers to this extent.

An additional third message that is not clearly elaborated regards the role of metadata enrichment with vulnerabilities databases.
Can the authors please clarify why it is relevant to include such metadata with the build process? How does this contribute to foster reproducibility?

3) The evaluation methodology is not convincing.
It is not clear why reproducing the same results of existing experimental workflow would prove the approach effectiveness. It proves the effectiveness of Docker.

The authors should clarify what are the Key Performance Indicators of their approach and maybe choose a different kind of evaluation.
A user-study comparing DockerPedia with existing tools using Technical Action Research would, for instance, clarify that automated annotation is somehow effective.

Alternatively, measuring the overhead of adding metadata to standard docker build might be an initial step.

## Section


This section starts with the concept of reproducibility then immediately discusses the existing conceptual frameworks to achieve it.
However, an introductory section should focus on positioning and motivating the work as done later on in Section 3.

A related work section is usually comparative and presented at the end of the paper to discuss similarities and differences of the proposed approach with
state-of-the-art ones.

A background section would fit more naturally in this position. Notably, the relevant information about Docker,
DockerHub and Open Container Initiative should be introduced here. Indeed, they are all the necessary information necessary to understand the content of the paper.

Moreover, although I understand that vulnerability analyses strongly rely on metadata, I miss the connection with reproducibility.
Potentially a bug can explain some experimental results. However, I find this more a potential use-case of semantically enriched build
metadata, rather than part of the pipeline.

The authors mentioned two ontologies for describing docker files and docker metadata.
However, they opted for developing their own vocabulary. What are the arguments against the reuse of existing resources?


This section mixes background information about the Docker ecosystem with a set of representative problems
for container-based scientific workflows.

These problems are later on renamed as requirements. However, no contribution argument for a solution that satisfies the requirements
is presented.

I suggest restructuring this section as part of the background.


this section presents the architecture of the DockerPedia consisting into

- An Ontology extending WICUS from [3]
- An Annotator REST Service
- A triple store with a SPARQL endpoint to query the collected metadata.

The annotation process makes use of the ontology as well as Clair, a tool for image vulnerability analysis that produces
detailed reports about the images building process and the packages installed.

Why did not the author extended the build engine to make use of their metadata?

* repeated sentence "Clair downloads all layers of an image,
mounts and analyzes them, determining the operating system of the layer and the packages added and
removed from it"

Finally, the authors present some examples of SPARQL queries that check for differences across images.
The utility of the examples is clear, but wouldn't a validation language like SHACL be more appropriate for the purpose?


The evaluation method proposed by the authors follows 5 steps

- create docker images for each of the proposed workflows
- annotate the images using DockerPedia
- reproduce the environment using DockerPedia
- compare the environment with one set up in a virtual machine
- run the experiments and compare the results

The authors claim the equality of the results will confirm the effectiveness of their approach.
However, it just proves that

- DockerPedia is bug-free (for the cases solicited by those experiments0
- the containers produce the same results of virtual machines (which we know is the case due to the vast industrial adoption)

Instead, it is not clear what a researcher would obtain by using DockerPedia rather than one of the proposed tests.

The authors provide an example of a dependency change in the SoyKB image.
They claim they were able to spot this dependency change thanks to the annotations.
I do not doubt this was the case, but the reason for this problem is that they used "latest" tag as a dependency,
which is not recommended.

Review #2
Anonymous submitted on 18/Apr/2019
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The authors propose in this paper a solution for describing virtualization of scientific workflows with the view to redeploying them and ultimately checking the reproducibility of their results. In doing so, they assume a context in which docker is used for virtualization and focus on the problem of "annotating" such virtualization. They assessed their solution using 5 workflows.

Strong points:
C1. The solution proposed is simple and practical and can be used for describing environments used for running workflows
C2. The method proposed is assessed empirically.
C3. The paper is well written modulo few typos.

Weak points.
C4. The authors are in my opinion over-selling their solution.
C5. The solution proposed is not particular to workflows.
C6. The security aspect is not dealt with properly.

Regarding C4. The authors are overselling their solution. The introduction and related work let the reader think that the solution proposed by the author allows for the physical and logical conservation of workflows. To a certain extent, the reader is misled in the beginning to think that the workflow will be semantically annotated, together with the resources that are used and the datasets. In essence, however, the authors are dealing with purely physical conservation aspects, in the sense that the solution is mainly targeted for the redeployment of the environment on top of which the workflow is executed. To actually know the gist of the paper, I had to wait until the introduction of Section 4 (the last paragraph of the introduction ), which states that given a docker image, what the authors are doing is extracting information about the steps that are used to redeploy the environment, with information about each (deployment step). I think that the authors needs to focus on this from the beginning of the paper.

Regarding C5. IMHO, the solution proposed applies to any docker image. There is little in what is output that has to do with the conservation of the in-silico experiment. Actually, the paper should be clearly redirected to focus on the physical conservation of the virtualization, as opposed to the reproducibility of the in-silico experiments.

Regarding C6. I think that the authors opened up a can of warms. First, the motivation connecting reproducibility to security made by the authors was not compelling enough, and second, the solution proposed is not a breakthrough. Therefore, I once again suggest to the authors to focus mainly on the annotation and conservation of virtualization built on docker.

Review #3
By Daniel Nüst submitted on 18/Apr/2019
Major Revision
Review Comment:

## Page 1

[number] (column)

37 (left): First sentence is confusing: how is introducing changes necessary for reproduction? I also recommend to clarify the usage of reproducibility early on, since "consistent with the original one" does not mean "equal to the original result", does it? The term is inconsistently used across disciplines, and for readers a clear statement on how the authors understand the term is valuable, cf.

I applaud the stressing of re-use and extension as the ultimate goal of reproducibility.

48 (left): The connection between workflows and large scale computations is unclear. Why is reproducibility not relevant for small workflows, running on a scientists laptop?

39 (right): Do you include hardware in "computational resources"? It becomes clear later, I suggest to explain "computational resources" early own or use simpler words.

47 (right): Use of term "replication" unclear (see above) - it is commonly understood as comming to the same conclusions without having any information from the author.

47 (right): "additional information" - additional compared to what? Probably explained later.

## Page 2

9 (left): Does this refer to "the" Research Objects?

21 (left): How do _virtual_ machines help with _physical_ conservation? Storage demands of virtual machines are not a strong argument when datasets grow larger and larger every year. The used arguments in favour of containerisation (lines 36 ff.) are not specific to containers but also hold for VMs. I suggest to clarify the advantages for containers over VMs (challenges of UI-based workflows? Dockerfile as recipe?).

1 (right): Others have proposed Docker images for reproducibility (cf., as also detailed in section 2. Please clarify the new contributions of the presented work, which is afaict at this point the annotation of images.

6 (right): "Containers are lightweight" is an often used argument, but needs clarification. Light on what scale? Is a few GB saved storage and quicker boot duration really relevant in scientific reproduction of workflows done only by a handful of readers?

## Page 3

19 (left): Data storage costs are high - please provide support for this argument. The used Docker Hub for storing images, for example, is completely for free.

31ff. (left): While I support the conclusion of VMs not being suitable, I think the discussion takes some shortcuts here. Why can I not know what is inside a VM? Why do I not know what is in a container, when it was created from a Dockerfile (line 46)? Please take a look at ReproZip ( as a packaging format that abstracts from VMs and containers.

49 (left): Please re-check your statement "these works ... only express desiderata". At least Marwick provides a case study (= with a solution).

10 (right): I cannot follow the argument leading toe scalability and security issues.

12 (right): It is unclear what role software vulnerabilities play for the presented work. The authorship of containers in a scientific setting should be clear, thus trust is usually not an issue. Please clarify the relevance of security for reproductions in scholarly settings.

30 (right): Please clarify the differences between the set of ontologies presented by the authors and the existing ones. What are their shortcomings?

46ff. (right): (and also other places) Containers can also be black boxes, right? Why is a VM "fixed" - a use can start it and make changes to it. Why do manual annotations reduce quality and trust?

Also, I suggest not to hide the two main problems tackled at the end of the related work section (see comment above about main contribution).

## Page 4

44 (left): Introduce abbreviation "WMS"

50 (left): "they archival" - "the" or "their" ?

3 (right): I still have problems seeing containers as a solution for _physical_ preservation. Physical preservation of software or data in my understanding would be a (brick and mortar) library that has several hard disks of data stored in different locations.

6 (right): "audit features" mentioned here, and only here. Please expand on that use case or stick to "annotation".

13 (right): "images, containing the Operating System and dependencies" - it is my understanding that containers do not include the "operating system", namely not the kernel. That is the whole point. Please recheck. Also, if size should remain an important argument (questionable), what about Docker's image layers role for storage size?

41 (right): How does the system identify layers (to be reused) - by the layer ID I assume? Could be clarified for non-Docker experts.
This section might also be a good place to introduce the files the store the image/layer metadata, or mention the Docker API ?

## Page 5

12 (left): Extend an existing Dockerfile - this can be confusing, I think it is worth introducing the concept of parent images here (FROM ...). Also, the Dockerfile does not "finally uploads it to Docker Hub", please clarify (Docker CLI vs. Dockerfile)

13 (left): "the software packages needed to build the Docker image" - the only software needed to build the image Docker, you probably mean all software packages installed into the container?

27 (left): "which components are installed..." - please consider that the statement about intuitiveness of Dockerfiles strongly depends on a persons background. You might make the argument here that that is the case for scientists, but then please connect to other parts of your manuscript how you solve that problem.

30 (left): "Also, some components might exist in the container that are not specified by the Dockerfile itself" - which ones? Do you mean dependencies of the installed software?

52 (left): What does "light-weighted" mean in this context? Suggest to rephrase.

1 (right): Please elaborate on the expected process of reproduction - why does it affect "production infrastructures"? Because you only cover HPC workflows?

5ff (right): versions and rollback: This is quite short, and I can imagine what you mean, but it would be better to strike this or explain thoroughly, or find a reference: Do you mean Dockerfiles under version control, or tagged images (with time or releases) ?

47 (right): Are DeploymentPlan, DeploymentStep etc. not also classes and should be typeset in the same font as SoftwarePackage on the following page?

## Page 6

4 (left): Suggest to mention Singularity earlier, when you introduce containerisation.

9 (left): First mentioning of "dockerpedia" - please explain! Also, what is the relation to "docker:" namespace in Figure 1 ? Shouldn't it be the same.

14 (left): "In summary, we annotate every installed software package on the container file system." Please clarify if you annotate the packages that are mentioned in the Dockerfile, or also that package's dependencies. I assume not the latter. Have you considered running a command like `dpkg --list` to get a list of all installed software?

27 (left): "annotation service implements a REST interface" - please provide a link to the specification of said interface.

10 (right): Please clarify how you handle base images (FROM ubuntu) and image stacks - or is the metadata for all base images given?

39 (right): How can you model source-based installation, e.g. wget-ting an archive and installing with make? I think that is a common approach, especially since Docker is often used to provide a software stack that is not easy to install on all platforms (i.e. software that is not available via package managers).

36 (left): Right - you use Clair to capture all dependencies. Very good, please consider my above comments as requests to mention that earlier. Can you provide an example link to how the result of a Clair analysis looks like? I think it could illustrate your integration of the different usd tools well.

41 (left): Can you provide a citation for Clair, or just the GitHub link? In general, I would kindly ask you to double check if you cite each software the way they want to be cited, to give proper credit (which does not work with just a URL).

22 (right): "we extend Clair in our system" - can you provide a link to a pull request or commit with your extensions? Are your changes in ?

## Page 7

Listing 1: I don't see a tensor-flow package installed - maybe these are just the depenencies needed by tensorflow?

33 (right): "To do that, we create the Docker image again using the previous annotations." Can you clarify or give an example for a whole "round trip", Dockerfile > Docker image > Annotations parsed from Dockerimage > Dockerimage (or do you generate a Dockerfile from the annotations?)

37 (right): "just repeating the package manager install command": I think the "just" in that sentence does not pay justice to the complex system you built. Can you provide an illustrative example, e.g. what information does Clair extract, and what install command do you create for a specific package and version? Or does it not differ at all? Do you use the specific version in the apt commands?

Listing 2: It is unclear why you would query for Pegasus software packages. Please better introduce the example, considering that Pegasus is properly introduced only on page 10.

## Page 8

Fig. 2: Please provide links to the source code of the Annotator (d3.js-based, Go). Also, consider adding numbers to structure the interaction - the order in which requests are made is not obvious.

49 (left): You say "five different experiments" but only have 4 names in the brackets after that. Please clarify.

24 (right): "guarantee that our approach is platform independent" - Please clarify if you actually run 45 workflows (five experiments * three workflow systems * three execution environments). Later it does not become clear which platform was used to generate which output (e.g. extractBudged-reproduced.csv - which platform does it come from?)

48 (right): I don't think "imports" is a proper term to use with one image being based on another one. As said above, properly explaining the FROM command is probably worth it for readers without Docker expertise.

## Page 9

17 (left): "We rely on .. to" - this seems odd, suggest to rephrase. Maybe "We rely on Docker Images stored on DockerHub for the physical conservation." ? Still suggest to rethink this, as DockerHub might disappear any day, while a proper data repository (Zenodo, OSF, figshare, b2share) is more likely to actually "conserve" data.

24 (left): "so that any user inspect and improve them" - add "can" ?

20 (right): "similar enough": Please clarify your criteria! What is "enough" - should the workflow be executed, or be executed and have the same results? You do say that later, so I suggest to consider striking this unfortunate phrasing.

25ff (left): Please reconsider "storable" as an evaluation criterion, same as lightweight. If you compare the disk usage: where is that data?

29 (right): "With the SPARQL query in Listing 4 is easy to spot the differences between both execution environments." - I disagree, with the _result_ of that query a user could spot the differences. I suggest to also link to or include the result.

## Page 10

21ff (left): The relevance of the software requirements for Pegasus are unclear. You do not mention that for the other workflow software. Also, I suggest to make clear that you created the Docker Images for Pegasus while you could use existing ones for example for dispel4py.

31ff (left): "The workflow ..." - wow, that's a tough sentence to digest for a non-genomics expert. Please consider either explaining/adding references (Wikipedia?) to what SNP, GATK, haplotype are, or rephrasing in more general words. When a reader wants the details, there is [22, 23]. Please clarify why you use that worklow (was is readily available? published under an open license? typical?)

22 (right): "in yellow" "in green" - sentence is probably missing a reference to Fig. 3 ?

24 (right): It is unclear what "baw, gatk and picard" are. They are not in Fig 3.

28 (right): "ome of its steps being probabilistic" - can you clarify why you are not able to set a seed and thus come to the same results? This is common for reproduction of randomised workflows.

34 (right): I think it is a bit risky to go from "similar outputs" to "successful reproduction". Please transparently define your criteria (which might not require bitwise equality), don't just use "similar".

## Page 11

34 (left): Cool that you use perceptual hashes for the comparison! The Zenodo links in the footnote do not seem to contain any images though.

39 (left): Footnote 25 is for "DockerHub" but the links in the footnote actually point to Zenodo. Please fix/clarify. It is very good that you put snapshots of your code on Zenodo! Please consider doing that for the other software you developed, too (like the annotator).

## Page 12

Footnotes 30 and 31 are suppossed to got to DockerHub and GitHub judging by the text, but actually are Zenodo DOIs. Please clarify/fix.

25 (right): "obtained the same results" - please clarify how you checked that. It would be great if you, for example in the README of the results repository, could provide the commands for a reader to re-run the experiments, i.e. how you generated the files in results repository.

45 (right): "include the complete list of installed packages on the Dockerpedia GitHub" - which repository precisely?

31 (right): WINGS section does not report on results of a reproduction. Is WINGS used for MODFLOW-NWT (which is a subsection)?

## Page 13

17 (left): Why do you not run a line-by-line comparison of the CSV files for modflow (extractBudged.csv)? The images are just a visualisation, you can check the actual data. As such Fig. 9 does not add real value to the manuscript.

28 (left): "different Docker version" - conflicts with page 9 "The Docker version (37) tested for this experimentation", doesn't it? If Docker versions differ, you should extend Table 1 to include that, and to include the precise architecture ("64" probably means bits, right? Could still be ARMv8-A or RISC-V, but probably isn't either.)

32 (left): "predefined VM image": It is unclear how you executed the workflow in the VM (I can guess that you started it, logged in, then executed it), please describe that in the previous sections though for transparency.

48 (left): Please re-consider using other means for comparing probabilistic outputs. Could you not introduce error margins? I guess the results might be a little bit different but should not contradict each other. Also this does not fit the later sentence "equivalent in terms of size and content". If the content is equivalent, then why not compare it directly?

34 (right): "produces a histogram by zone": As stated above, I suggest to compare the data underlying the histogram, it should be more precise (and possible with a diff tool)

## Page 14

38 (left): "the graph does not have a conflict" - please explain: to me Java == 1.7 and Java >= 1.8 are a conflict, you can only have one of those.

35 (right): "noiseless." - suggest to re-word. What does noise have to do with containers and annotations? Why does a correct execution of an experiment lead to security?

39 (right): "We can detect the similarities and differences between two versions of a image" - just to be clear: for your reproductions, did you perfectly recreate the environments, i.e. did you not have any differences, in versions of underlying libraries etc. ? I strongly suggest to provide the output of the comparison queries in the results repositories on GitHub.

42 (right): "The Docker images takes less disk space compare to Virtual Machine images" - while this is true in absolute numbers, for the argument to hold I would like to see the size of the data of the workflows. Or is the full data included in each image?
Also, the Container for dispel4py is larger than the Virtual machine for Pegasus - if you can store the dispel4py container, why can you not store the Pegasus VM ?

Table 2: VM and image sizes for MODFLOW-NWT, WINGS are missing.

## Page 15

Fig 10.: Not all orange nodes are actually different, at least not in the figure. Only the text of the "Pegasus 4.8/4.9" node changes. If SoyKB version changed to, I suggest to add the version in the node. Also the "Java == 1.7" did not change. I might be missing the point here, so a more extensive figure caption might help.

22 (right): Please check with ReproZip and other tracing-based tools about perf events, such as Parrot (, if there might be overlap to this idea.

## Page 16

Reference 25: Please use one reference for each of the Zenodo repositories. It is really great (!) that you put your results on Zenodo, but the metadata there is really insufficient. "Commit cited in the master's thesis" is not helpful.

## git repositories

- I suggest to add a short introductory paragraph to each README so that readers understand _what_ the analysis that is include is about, ideally with a reference to the source/original paper.

- Please make sure all your repositories have a useful README, and include a LICENSE ( for example does not)


- Is it possible that I run the annotator myself? Could you add instructions to the README of ?

- Consider turning your GitHub projects into binders. It will allows readers (and reviewers..) to easily follow your steps! (Applies also to montage_results and internal_extinction_results)

## Final comments

Overall I found the manuscript well written and understandable, and referencing relevant literature and related work. The results are original, but the could be reported more thoroughly and clearly.
The manuscripts needs some edits, not the least because I apply high standards for reproducibility, which I assume are in line with the authors intentions, as their topic is strongly connected with computational reproducibility. I do thing that all information that might be missing exists, and there is no need to re-run any experiments to fulfill my suggestions.


- Some arguments need fleshing out and critical review, especially in the introduction and related work. I understand some things seem obvious to developers, but I think (as a researcher also relying heavily on containers for reproducibility!) the paper should be clearer/more realistic on the problems solved and unsolved, and also about the relevance of some challenges.
- I am not an expert in semantic modelling, and since this topic and the presented solutions are surely relevant for other communities, I suggest to accommodate readers from other domains where possible.
- The possibility to detect changes between two environments seems like a "hidden gem" in the article and could be expanded upon.
- The results section reports on the results of the reproductions, but not on the results of the recreation of the environment. It would be helpful to see a Dockerfile example generated from the annotations (if that is how it works), and if the READMEs for the results repositories (or any repository where you see fit) contained the `docker run` commands to execute the workflow.
- The diverse workflows you use support the stability of your approach, but it is really hard for readers to understand what happens in the workflow specifically, because you need a lot of domain expertise. I suggest to rephrase workflow descriptions in more generic terms provide a glimpse, and then reference the literature for details.
- "dockerpedia:" and "docker:" as namespaces are mixed, but probably only one is actually used?
- The extension of the Clair tool: Do you consider contributing them back to the original codebase?
- The conclusions could be balanced a bit with identified shortcomings of the approach.
- WINGS vs. MODFLOW-NWT relation is unclear (the latter is a subsection of the former, whose section does not report on any reproduction)
- Provide more examples: For me browsing on helped a lot.
- Source code of Annotator is missing in the text AFAICT, but it is a core part of the work. Should be more prominent, and ideally published in a repository with a DOI.
- One further comment on formatting: I am unfamiliar with the used template, but it would be helpful if all references contain a DOI, currently not all do. Suggest to use the prefix (not http://dx.doi...).