Dataspecer: Development of Consistent Semantic Data Specification Ecosystems

Tracking #: 3954-5168

Authors: 
Stepan Stenchlak
Jakub Klimek
Petr Škoda1
Martin Necasky

Responsible editor: 
Oscar Corcho

Submission type: 
Tool/System Report
Abstract: 
To achieve interoperability for effective data exchange on the web, we need a contract. Depending on the field, we may refer to technical data schemas, web vocabularies, or, more generally, to data specifications (DS). The development and management of these DSes can become difficult in complex domains with multiple stakeholders and related DSes involved. In this paper, we present Dataspecer, an open-source, modular web application for the development of semantic data specifications (SDSes), DSes that target the semantic and technical layers of data exchange. Dataspecer allows users to design web vocabularies and their application profiles, maintaining relations between reused concepts and their original SDSes. Furthermore, Dataspecer assists users in the creation of technical artifacts such as schemas for JSON or XML, while maintaining consistency of the artifacts with the application profiles. We motivate the need for SDSes and derive requirements for such a tool. In case studies based on the ecosystem of DCAT-based specifications, we demonstrate that SDSes created in Dataspecer meet these requirements and are of higher quality. We show SDSes that were created directly in Dataspecer, and in the evaluation section, we argue that using our tool is more efficient than creating them manually, even for smaller domains.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 02/Jan/2026
Suggestion:
Major Revision
Review Comment:

This paper introduces Dataspecer, an open-source, modular web application that supports the end-to-end authoring and maintenance of Semantic Data Specifications (SDS. Authors describe them as specifications that preserves consistency across artifacts and reuse relationships, while jointly address (i) semantic interoperability (vocabularies + application profiles) and (ii) technical interoperability (schemas/artifacts for concrete serializations such as JSON and XML).

Regarding [(1) Quality, importance, and impact of the described tool or system], main contributions are as follows:
1-Problem motivation and user requirements for SDS tooling. Authors provide a clear motivation for why complex interoperability ecosystems require tooling beyond ad-hoc documentation. Also, they systematically derive 12 requirements for a tool supporting SDS ecosystems (based on real-word case studies).
2-A unified tool for vocabularies, application profiles, and technical schemas (named Dataspecer) that supports development of vocabularies and application profiles with explicit handling of reuse and contextual refinement, including generation of mappings from technical artifacts and semantic concepts.
3-Architecture with a flexible/extensibile model. The paper presents a modular architecture with a core that cam be adapted to different unvision scenarios.
4-Use-case driven evaluation (DCAT standard an application profiles at European and nnationla level), complemented with usability evaluation comparing the manual approach and Dataspecer.

Also, the strengths of the paper are described next:
1. Timely research and practical problem resolution. The paper addresses a real pain point: maintaining consistency, reuse provenance, and evolvability in complex SDS ecosystems (e.g., application profile hierarchies).
2- End-to-end scope across semantic and technical layers. Unlike many tools that stop at vocabulary editing or validation, Dataspecer tool explicitly bridge the gap between conceptual models and technical schemas (including mapping artifacts that reconnect the layers).
3- Explicit treatment of application profiles and reuse-with-modification. The authors convincingly argues that common practices on application profiles (context-specific label/definition, domain/range constraints, cardinality constraints or multi-term profiling) are not naturally represented in plain RDF/OWL and are often relegated to general documentation. Dataspecer’s explicit profiling is a strong conceptual contribution.
4- Modular tool that can be extended as required.
5- Grounding in real ecosystems and deployment. The DCAT case study on Czech public administration deployments strengthen the paper’s relevance and maturity. Also it is very interesting to mention the EOSC scenario.

Finally, I am concerned about the following weak points of the paper and related recommendations:

1- Evaluation. The paper states that SDS created in Dataspecer are higher quality and, intituitevely seems in this way. However, quality is not totally defined with measurable criteria (e.g., fewer inconsistencies, fewer ambiguities, completeness of reuse metadata, reduced duplication, fewer downstream implementation errors). I would recommend to include in the paper at least one measurable quality metric for the case studies (e.g., number of explicit reuse links captured, number of context-specific profiles represented machine-readably, detected constraint conflicts across profile hierarchies).
2- Productivity comparison is not at the same level. Authors compare manual creation of only JSON Schema with Dataspecer creation of a full SDS (conceptual model + schema + additional artifacts). While the authors acknowledge the mismatch, it still complicates interpretation: a reader may struggle to conclude how much of the reported advantage comes from tooling a how much comes from comparing different outputs. I would recommend to authors to further explain this.
3- FAIR support remains partial. Requirement 12 is acknowledged as only partially met. The paper would benefit from a clearer, more concrete plan (or current partial checklist) to close this gap and what users can realistically achieve today.
4- There are other fields in which profiles are used. For example, UML profiles in software engineerings. It could be very interesting to inlcude in the related work section, UML profiles and how they are related to application profiles.

Regarding [(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool]:

1. The paper has a well-structured narrative flow: motivation, requirements, tool, related work, evaluation(s) and conclusions.
2. Figures are illustrative, including the architecture diagram and UI screenshots.
3. Limitations are acknowledged: especially around FAIR support and usability challenges, which helps convey limitations rather than hiding them. This increases trustworthiness.
4. Some critical mechanisms are underspecified, e.g., structural modeling language and its transformations are essential for understanding technical interoperability support, but are not described in sufficient detail for readers to judge generality and limitations.

Regarding the [Long-term stable URL for resources]:

(A) Organization and README quality. The repository appears well structured and includes a top-level README.md that includes Docker run instructions (including port mapping and notes about persistent storage) and describes local build expectations. Therefore, it ha an overall good organization and the repository is understandable for developers.

(B) Completeness for replication of experiments. For software replication, the Docker image instructions are a strong point and likely sufficient for starting a local instance. However, for replication of the paper’s evaluations, I did not find (from the repository landing page) any resource cointaining materials or data that would allow a third party to reproduce the evaluation of the paper. Therefore, replication is partial: strong for running the tool, weaker for reproducing evaluation results.

(C) Repository appropriateness for long-term discoverability. GitHub is widely used and discoverable, and it is a reasonable place to host open-source software.

(D) Completeness of provided data artifacts. Artifacts for tool use are largely present (source code, Docker image reference, and documentation).

Review #2
By Mario Scrocca submitted on 19/Jan/2026
Suggestion:
Minor Revision
Review Comment:

This paper presents Dataspecer, a modular, open‑source tool for the authoring, management, and publication of Semantic Data Specifications (SDSs). The system covers vocabulary creation, application profile (AP) design, and the definition of technical schemas in other formats (like JSON and XML).

1. Quality, Importance, and Impact
The paper addresses a highly relevant and timely problem. The number of application profiles built atop standard RDF vocabularies (e.g., DCAT, DCAT‑AP and its national/domain-specific variants) continues to grow rapidly, and maintaining consistency among these interconnected specifications remains extremely challenging. As someone with practical experience working on such specifications, I found the motivations compelling and the tool useful. In particular, the explicit management of reuse, profiling hierarchies, and cross‑context refinements responds directly to recurring difficulties in such projects.
The tool offers clear and practical value, especially through: (i) the structured support for semantic reuse and its machine‑readable serialisation using the DSV vocabulary, (ii) the hierarchical views of specifications also in the automatically generated artefacts (cf. Figure 13), (iii) the support for technical schemas semantically bound to the SDS. The authors also provide comparative tables and a rich related work section that clearly position Dataspecer in the ecosystem, highlighting similarities and differences w.r.t. exisiting solutions.

The impact is convincingly demonstrated through multiple real-world usages of the tool across different use cases (Czech FOS specifications, DCAT‑AP‑CZ, and EOSC‑CZ research data repositories) and the tool online presence. Having over 30 GitHub stars is a positive sign for a specialized tool, and the availability of a website, documentation, and a demo instance supports broader adoption.
Regarding limitations, I am not fully convinced by the proposed approach for transforming technical schemas to RDF (lifting mappings) and it is unclear why the system does not consider declarative mapping languages such as RML. In my experience, mapping JSON files only via JSON‑LD context may fall short (e.g., when transforming deeply nested or recursive structures). Also, lowering mappings are mentioned (from RDF to other formats) but it is not clear if it is a feature of the tool (e.g., Figure 3 mentions only mappings FROM technical schemas TO semantic level).

2. Clarity, Illustration, and Readability
The paper is generally clearly written, well‑structured, and supported by informative figures and examples. However, terminology consistency needs improvement and I would suggest better stating some "definitions" to avoid confusion in the reading of the paper:
- (web) vocabulary vs semantic data specification vs application profile vs profile: all these terms are used without clarifying which are the differences among them? I found some clarifications by looking at the DSV vocabulary, but I believe it is important to mention this also in this paper to make it self-contained.
- Referring to DCAT as a default application profile (DAP) is confusing in my opinion (and, as the authors acknowledge, also for users in the evaluation). I would stick to the SEMIC definitions classifying DCAT as a Core Vocabulary (https://semiceu.github.io/style-guide/1.0.0/terminological-clarification...). If the authors chose a different terminology in the context of the tool, this should be explicitly justified, and the relationship to existing definitions clearly explained.

The evaluation section should be improved, especially the section on "productivity and usability".
In the DCAT‑AP evaluation, the paper effectively highlights shortcomings of current AP representations (e.g., the limitations of relying solely on SHACL to represent certain intended usage), but does not always clearly explain how Dataspecer addresses these issues.
Regarding productivity and usability, I appreciate the effort to provide such an evaluation, but I find the presentation confusing. The authors present five use cases but then state they cannot be addressed in the evaluation; only after several paragraphs is it explained that the idea is to use a "cost" model based on smaller tasks to make an estimate. Furthermore, the cost computation is not sufficiently explained, making it difficult to interpret the values in Tables 2, 3, and 4. The statement that participants were asked to "assume expertise in relevant technologies" is also ambiguous: it is unclear whether participants were actual experts, or if the evaluation disregarded the time needed to gain expertise in Semantic Web technologies, the domain, or the tool itself. The paper mentions that many users were already Dataspecer users, but does not specify how many, nor does it clarify what users had to learn, especially in light of the later comment on Q10. These points should be clarified to strengthen the evaluation.

Minor issues:
- In the introduction, the long sentence starting with “The reuse increases the likelihood…” is difficult to understand.
- The bulleted list of Dataspecer features in the Introduction is not so easy to understand at that point of the paper (it becomes clearer later, after introducing the three use cases addressed by the tool).
- Figures 2–3 could better distinguish artefacts directly used by the tool (e.g., based on DSV or custom format interpreted by the tool) from exportable artifacts
- Page 7: it is stated that the result can include RDFS+OWL. It should be clarified whether Dataspecer allows editing OWL axioms
- Page 10: when stating that “dct:title has a different title when reused in DCAT,” consider using “label” instead of “title”
- Page 13: clarify the sentence "(ii) evaluating the creation of APs would be uninformative because our premise is..."
- For long-term reference and citation, I recommend providing DOI-based versioned releases via Zenodo in addition to the GitHub codebase URL.

To summarise:
+ Reusable tool addressing very relevant use cases and requirements
+ Related work section very complete comparison and cross-tool assessment of functionalities
+ Good cross-referencing of requirements and tools features with concrete implementations
- Terminology used can be better explained and aligned with SEMIC definitions
- Evaluation section, especially the one about usability is not clear and should be improved
- Lifting/lowering mappings to be better explained and positioned w.r.t. the literature