Review Comment:
The paper proposes the term "data journeys" for the sake of open science by providing the possibility of explaining data science workflows at different levels of abstraction.
Developing more sophisticated abstraction models to represent the code with graph representations is a potentially impactful and promising area of research, especially for the transparency and interoperability of machine learning.
There are several topics in the paper, such as ontology, extraction, and machine learning to classify activity types.
However, the overall impression is that this work does not go deep into these topics.
Notably, the paper neglects some previous related works.
This mostly concerns:
a) data science ontologies
b) previous efforts for extracting knowledge graphs from source code
The current paper should clearly state how it differs from related efforts and be more precise in terms of envisaged usage scenarios and the particular focus of the paper regarding such scenarios.
In particular,
1) regarding the formulation of the task:
What is a data journey, and what is not a data journey? How does the data journey differ from efforts to model data science workflows such as Research Objects [21]? Mainly by the used schemas/ ontologies to represent a data flow?
What problems can data journey solve that are not solved by related past efforts? Conversely, what problems can data journey solve better? The paper should focus more on explainability issues than generic data science workflow representation, which was also covered in previous works.
There is a need for better motivation on how the proposed method helps to explain the data flow or activity workflow.
2) Regarding ontologies:
Similar abstract representations of a data flow have been used for various tasks [6-10].
Cannot ontologies associated with those efforts be used to represent data journeys?
Activities include: Analysis, Cleaning, Movement, Preparation, Retrieval, Reuse, and Visualization.
Is the above list exhaustive? Can there be more activities on the list beyond those from Workflow Motifs Ontology (reference needed)?
Some seem to overlap (like Cleaning can be assumed a type of data preparation).
There is a body of works on ontologies for representing data science experiments, such as:
Ilin Tolovski, Saso Dzeroski, Pance Panov: Semantic Annotation of Predictive Modelling Experiments. DS 2020: 124-139
Gustavo Correa Publio, Diego Esteves, Agnieszka Lawrynowicz, Pance Panov, Larisa N. Soldatova, Tommaso Soru, Joaquin Vanschoren, Hamid Zafar: ML-Schema: Exposing the Semantics of Machine Learning with Schemas and Ontologies. CoRR abs/1807.05351 (2018)
Pance Panov, Larisa N. Soldatova, Saso Dzeroski: Ontology of core data mining entities. Data Min. Knowl. Discov. 28(5-6): 1222-1265 (2014)
C. Maria Keet, Agnieszka Lawrynowicz, Claudia d'Amato, Alexandros Kalousis, Phong Nguyen, Raúl Palma, Robert Stevens, Melanie Hilario: The Data Mining OPtimization Ontology. J. Web Semant. 32: 43-53 (2015)
Here is even some overview:
Larisa N. Soldatova, Pance Panov, Saso Dzeroski: Ontology Engineering: From an Art to a Craft - The Case of the Data Mining Ontologies. OWLED 2015: 174-181
There exist a code ontology:
Mattia Atzeni and Maurizio Atzori. 2017. CodeOntology: RDF-ization of source code. In International Semantic Web Conference. Springer, 20–28.
3) Regarding graph extraction:
"the feasibility of automatically generating a graph representation, anchored to the source code" evaluation objective seems to address a task, where there have been some works in the past that have already proved the feasibility of this task, e.g.:
Kun Cao and James Fairbanks. 2019. Unsupervised Construction of Knowledge Graphs From Text and Code. arXiv preprint arXiv:1908.09354 (2019).
Azanzi Jiomekong, Gaoussou Camara, and Maurice Tchuente. 2019. Extracting ontological knowledge from Java source code using Hidden Markov Models. Open Computer Science 9, 1 (2019), 181–199.
and similar recent work:
Ibrahim Abdelaziz, Julian Dolby, Jamie P. McCusker, Kavitha Srinivas: A Toolkit for Generating Code Knowledge Graphs. K-CAP 2021: 137-144
There are machine learning approaches to summarize code:
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. arXiv preprint arXiv:2005.00653 (2020).
Therefore, the research question or hypothesis might be how the proposed approach for code extraction to a graph differs from the previous efforts and how it is better? Or maybe also how the purpose differs (explainability, transparency)?
4) Also, from a methodological point of view, the paper re-uses generic workflow motifs ontology, while the authors specifically apply it to data science workflows.
Data science workflows have some common structures, like evaluation protocols of cross-validation or leave-one-out etc. and represent ML phases.
The chosen activity model may seem too abstract. For instance, considering Fig.3: how the proposed graph can be used? What are the scenarios for using such as graph where it adds value? It looks like a generic data science workflow starting from data loading, then pre-processing, analysis, and finally visualization. That is correct, but what could be the added value of this particular example?
Additionally, the paper (and its title) says about the data journey (which I imagine as a semantically annotated data flow), but in the end, what we achieve is an activity flow as in Fig. 3?
5) Other remarks:
a) Missing references when names of artefacts or methods are first mentioned:
Workflow Motifs Ontology
CodeBERTa
BERTcode
b) Question regarding activity categorization:
Why computing tanh is in :Analysis, while computing mean is in :Preparation? What was the exact criterion here?
c) Fig. 1 is unreadable.
d)
"Parameter: any data node which is not supposed to be modified by the program but is needed to tune the 22 behaviour of the process. For example, the process splits the data source into two parts, 20% for the test set 23 and 80% for the training set. 2, 20%, and 80% are all parameters."
This naming convention may be misleading in the cited domain of machine learning. In particular, this is a definition of a hyper-parameter in machine learning. In contrast, Parameters are the ones whose value is actually changed (optimized while performing training on the training set).
e) Typo:
modesl
f) Overclaim: the proposed ontology is rich (in comparison to pre-existing data science ontologies, it is relatively not)
|