XMLSchema2ShEx: Converting XML validation to RDF validation

Tracking #: 1680-2892

Herminio Garcia-Gonzalez
Jose Emilio Labra Gayo

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
RDF validation is a new field where researchers among the Semantic Web are putting their effort. However, migration to new formats and standards comes at a price. In order to facilitate and alleviate this transformation, this paper proposes a set of mappings that can be used to convert between XML Schema and ShEx—one of the new RDF validation languages—. Moreover, this paper presents a prototype that supports a small subset of the mappings proposed in it and a example of a XML Schema converted to ShEx with this prototype. This work and the development of other formats mappings could drive to a new era of semantic-aware and interoperable data.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Felix Sasaki submitted on 15/Aug/2017
Minor Revision
Review Comment:

This paper describes an approach to convert an XML Schema to ShEX. The paper is well written, the conversion approach is sound. However, some parts of the paper should be clarified.

1) The abstract reads "In order to facilitate and alleviate this transformation, this paper proposes a set of mappings that can be used to convert between XML Schema and ShEx—one of the new RDF validation languages—.": the methodology does not seem to be feasible to convert from ShEx to XML Schema, but only the other way round. The authors should clarify if their approach foresees a conversion from ShEX to XML Schema or not.

2) The methodology is described without a practical usage scenario. The experience described by Sasaki et al. at
showed that the usefulness of XML <> RDF conversion is hard to achieve in a general, abstract way. The authors should consider if they foresee in the future such a validation in practice, and if yes, what (industry) field seems to be useful.

3) Some fields which use XML to describe semi-structured, textual data, may actually not be feasible, since XML formats like TEI (Text Encoding Initiative) have underlying schema models with ambiguous content models. These may be hard to tackle in an XML to RDF conversion. The authors should clarify if they see certain challenges for semi-structured data and ambiguous content models.

4) The authors focus on XML Schema, but for semi-structured data and for business logic other schema languages are of high importance in the XML world: RELAX NG and Schematron. The authors name these schema languages, but should clarify if they intend to cover them in future research and what the challenges would be. Maybe the "semantics" described in section "5. Conclusion and Future Work" is specifically relevant for business rules / Schematron.

5) The approach does not cover the case of a partial conversion of a schema. For a concrete usage scenario, it is maybe sufficient to do such a conversion and not for a complete schema. The authors should discuss if their approach could be adapted for a partial conversion.

6) The authors state that Extensions and restrictions are not supported. They should clarify if this influences the feasibility of automatic conversion.

Review #2
By Emir Muñoz submitted on 22/Aug/2017
Major Revision
Review Comment:

This paper presents a set of mappings to convert existing validations in XML schema to validations in an RDF schema. More specifically, the mappings cover a subset of XML Schema constructs which are converted into Shape Expression (ShEx) constructs. The mappings are presented one by one and illustrated by a small example showing the original XML Schema construct and the target one in ShEx. Finally, the authors briefly describe a prototype that implements the presented mappings and illustrates its application using an XML document/schema pair and their counterpart RDF document/schema pair.


The paper touches a valid point where organizations already have in place validation workflows for XML documents, and there is no tool to migrate these workflows to an RDF-based environment. However, the paper fails to really motivate questions such as: why such a tool is valuable, how a (lay) user can benefit from having such tool, and why to restrict the focus to XML Schema and do not consider other schema languages such as Document Type Definition (DTD), Schematron or Relax NG (which happens to be the direct simile and inspiration for ShEx).

First, the document assumes too much familiarity with both XML Schema and ShEx, missing the didactical component required when presenting and promoting the adoption of a new tool or system. The work presents a relevant application that could aid the migration of XML databases to RDF graphs without losing business rules and data semantics. However, because the paper lacks a research question and a set of hypotheses, and instead describes a set of mappings and a tool that implements them, this paper does not fully make a research contribution. I believe the type of submission should be reconsidered and better classified as an 'Application report' or a 'Report on tools and systems', in which cases the evaluation of the paper will be under the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper.

In general, the paper needs a lot of work to improve readability. In my opinion, It is clear the need for a minimalistic running example including a source XML document and schema, and a target RDF document from the beginning. I would recommend introducing such a running example in the introduction or as Section 2 right after the introduction and do not save it for the end of the paper. In that way, users can understand from the beginning what to expect from the rest of the paper.

In the following, I give my comments per section. (Please use the same format in your response.)

### Abstract

1. ShEx is used as an abbreviation without previous definition.
2. “a small subset of the mappings proposed in it and a example of a XML Schema converted to ShEx with this prototype.” -> “a subset of the mappings proposed, and an example application to obtain a ShEx schema from an XML Schema one.”
3. What do the authors mean by “other formats mappings”?
4. Clean up the keywords and put just enough and necessary keywords.

### Section 1

1. “Validation is one of the key areas ...” -> “[Data] validation is one of the key areas ...”
2. The first paragraph is not clear enough. What is the role of validation in data management? This should be clearer in order to arrive at the stated conclusion “validation is a key field of data management”.
3. It is not clear why the chosen XML schema language is XML Schema and not other like DTD, Schematron or Relax NG. Considering that the last served as an inspiration to ShEx!
4. It is unclear why XML Schema is “more convenient” than DTD, considering that DTD is simpler but not more expressive. Please clarify the point.
5. In the sentence “new possibilities come to scene” it is not clear whether the possibilities are for XML or RDF.
6. “As XML has its own … it was some that RDF lacked” -> rephrase the whole sentence to something simpler like: “Unlike XML, RDF lacks a standard schema language.”
7. What XML Schema does for XML that other languages don’t?
8. Although XML Schema is a central topic, It is missing a reference to it. Also, it is missing a reference for Shape Expressions!
9. “[11][7][1][4]” -> please merge the references “[1, 4, 7, 11]”
10. “How to be sure … are defining the same nature.” -> do you mean semantics instead of nature?
11. The paper is not clear on what is the research question. I suggest the authors check the work of Marcelo Arenas, Leonid Libkin, among others, in the area of schema mapping and data exchange, which could guide the definition of clearer research questions.
12. Use “Section” in upper case when describing the structure of the paper.
13. “section 2 describes how to convert each element from XML Schema to ShEx notation.” -> Section 2 presents the background!

### Section 2

1. It is not clear from the text why “conversion” is an active research field.
2. The second sentence of the first paragraph is way too long. The same for the second paragraph, which is a single sentence! Please break up the text into smaller sentences.
3. When describing related work, please clarify why a given work is related to your, how it differs from yours, and what do you make different.
4. “However, data validation is a key question as it has” -> Statements like this one should be backed up with a reference to other papers, surveys, literature reviews, etc.
5. “Another approach is to take” -> Another approach for what exactly?
6. “Based in the Semantic Web” -> "Based on the Semantic Web". Which aspect(s) of the Semantic Web? Could you be more specific here?
7. “However, neither OWL nor RDF Schema capture all the constraints that are supported by XML Schema and that RDF needs as it is stated by [16].” -> "RDF Schema captures". What are examples of the constraints in question here? Is [16] the unique and most adequate reference here? What about (i) Waseem Akhtar, Alvaro Cortés-Calabuig, Jan Paredaens: Constraints in RDF. SDKB 2010: 23-39; (ii) Georg Lausen, Michael Meier, Michael Schmidt: SPARQLing constraints for RDF. EDBT 2008: 499-509; among other previous works.
8. “To our knowledge, due to its recent appearance, no XML Schema to ShEx conversion has been proposed.” -> Do you mean the recent appearance of ShEx? -> What about paraphrasing the text to: “To the best of our knowledge, no conversion between XML Schema and ShEx has been proposed to date. This might be due to the recent introduction of ShEx.”
9. “In this paper, a transformation from XML Schema to ShEx is proposed and how each element could be handled on its individual translation.” -> It is missing the verb of the last sentence. -> “... a transformation … is proposed, indicating how each element ...”

### Section 3

1. Please introduce the use of prefix `xs`.
2. “Terminal expressions of a shape one” -> What is a shape one?
3. For each mapping showing the source and target constructs, I would recommend to put them together in the same listing. This will help the readers to make the connections when more than one listing is shown. It could also be beneficial to differentiate between the mapping itself and the example of a mapping. For instance, a caption to the mappings could be added stating whether the listing is a mapping or an example. This will also help to remove duplication of examples by referencing previous examples. Additionally, please consider a different font type for the listings, something different from the main text font type, e.g. \texttt{}.
4. Cardinalities, i.e., *, +, are used from Section 3.1, but only introduced in Section 3.6.9. A reorganization of the mappings could help.
5. In Section 3.3, it is not clear what a complex type is, how are they used or when. In general, a small motivation for each construct and mapping is missing at the beginning of the sections.
6. “more complicated due to the RDF graph schema” -> Do you mean graph structure? Seems that there is more to say about that statement.
7. First listing in Section 3.3.1, “address” -> “Address”
8. Second listing in Section 3.3.2, the original example only gives a ‘maxOccurs=unbounded’, but in the target, a cardinality ‘+’ is used. This assumes a ‘minOccurs=1’, which is not stated. A cardinality ‘*’ should be more appropriated.
9. “all is instead a set” -> This is an example where the use of a different font type becomes handy to differentiate the ‘all’ as a construct.
10. “For translation of restrictions see 3.6 section” -> “See Section 3.6 for the translation of restrictions” Avoid sending the reader back and forth as much as possible. Instead, provide a first intuition about the topic and refer to another place (e.g. supplemental material) for further information.
11. “These lists are supported in RDF using RDF Collections.” -> Could you provide an example of this? The mapping for lists lacks a clearer explanation since notions of recursion are required for understanding but not mentioned.
12. “Complex contents for complex types and simple contents for simple types” -> Seems there is more to say here. It reads as something intuitive but misses an example.
13. Use different font type for: ‘group’, ‘all’, ‘choice’, ‘sequence’, etc
14. “For translation into ShEx the restriction elements must be taken and transformed directly into a new shape that defines the child shape. ^4.” -> This requires of an example to clarify the idea. Missing comma after “ShEx”, and remove the extra dot.
15. There are mentions to ‘base type’, ‘base simple type’, and ‘base complex type’. For the first, the text refers to Section 3.7, but it does not provide any definition for none of them.
16. At the end of page 5, you mention using the normal syntax provided by ShEx to create two shapes from the respective restriction or extension. Is this a work around for some existing issue? Does this change the semantics of the final shapes? Please provide more information.
17. In Section 3.6.7, what are ‘semantic actions’? Could you provide an example for that?
18. Examples for Section 3.7 are missing.
19. Table 1 lacks a reference in the text.
20. The mapping in Section 3.7.2 seems that is not unique and another can exists. In the example, the SKU type is missed, why?

### Section 4

1. In the text, there is a mention of some supplementary material. Where can this material be accessed?
2. What is the ShEx Compact Format? Seems that a definition or proper reference is missing.
3. “The example presented below” -> Use captions for the listings and refer to them using \ref{}
4. As I mentioned before, the example of this section should be moved to the beginning of the paper, and maybe using both columns to improve readability of the documents.
5. Minor issues with the conversion in the example: (1) zip should be of type xs:integer; (2) quantity element is restricted using maxExclusive but converted to {1, 99} instead of using MAXEXCLUSIVE 100 as per Section 3.6.5; (3) for shipData only a cardinality of minOccurs=0 is given but translated to ? (i.e. 0 or 1), which is wrong from the data perspective but right from a business perspective; (4) the same for comment in PurchaseOrderType.
6. After reading the document, I noticed that there is a core research question that is missing and could be worthy to follow: What are the conditions to ensure a valid conversion? This is not an easy question, but a relevant one that is never mentioned in the text.
7. “that are being developed by other researchers in the community” -> Please provide references to support this statement.
8. Figure 1 requires a bit more of explanation of what exactly is being displayed on the screen.

### Section 5

1. Something highlighted in the conclusion what was never motivated is why it is required a migration from “old” data formats to “new” data formats. What are the benefits by doing so?

### References

1. Please, double check the references in the text: (1) ensure the right upper casing, e.g., XSLT, SPARQL, XML, etc.
2. There are missing years of publication, e.g., 1
3. There are missing venues of publication, e.g., 8, 9, 12, 14
4. There are missing pages for the publications
5. There is a missing URL for reference 6


1. “documentation of a XML vocabulary” -> “documentation of an XML vocabulary”
2. “As XML has its own schema language --- or languages --- it ...” -> “As XML has its own schema languages, it ...”.
3. “And section 5 draws some conclusions and future lines of work and improvement.” -> “Finally, Section 5 draws some conclusion and future lines of work.”
4. “In XML community,” -> “In the XML community,”
5. “Starting from the example” -> “Starting from an example”
6. “All examples are using the default” -> “All examples use the default”
7. “While sequences were an ordered” -> “While sequences are an ordered”
8. “Therefore, transformation is” -> “Therefore, the transformation is”
9. “When a simple type is restricted transformation is done using the known base type (see 3.7)” -> The sentence could be rephrased, e.g., “For a simple type its restricted transformation can be done using the known base types (see Section 3.7)” However, this still reads weird, read comment 15 for Section 3.
10. “This a case of” -> “This is a case of”
11. “and elements to new base” -> “and elements to a new base”
12. “ShEx does support this feature as XML Schema” -> This sentence is odd, try to rephrase it and double check its intended meaning.
13. “Max length and min length are” -> “Maximum length and minimum length are”
14. “In ShEx definition of min and max length is made” -> “In ShEx, the definitions of min and max length is made”
15. “Exclusive restrict the use” -> “Exclusive restricts the use”
16. “This is the same theory as in” -> “This is the same notion as in” or “function”
17. “In ShEx these features are supported directly” -> Missing comma. Also, prefer an active voice over a passive one, e.g., “ShEx supports these features directly”
18. “In ShEx white spaces options are” -> Missing comma, extra ‘s’. “In ShEx, white space options are”
19. “or ‘m, n’ for m to n repetitions where m is minOccurs and n maxOccurs” -> Missing ‘{’ and ‘}’. “or {m, n} for m to n repetitions, where m is minOccurs and n is maxOccurs”
20. “XSD types are used directly on ShEx” -> “XSD types can be used directly in ShEx”
21. “Although proposed mappings between” -> “Although, the proposed mappings between”. Please check the whole sentence and rephrase.
22. “XML data using XML schema” -> “XML data using XML Schema”?
23. “uses bnodes to represent” -> “uses blank nodes to represent”
24. “One future line that should be” -> “One future line of work that should be”

Review #3
By Simon Steyskal submitted on 01/Sep/2017
Major Revision
Review Comment:

The present paper proposes a possible set of mappings between components in XML Schema to their respective counterparts in ShEx. Additionally, the authors provide a PoC implementation for transforming XML Schema Definitions to ShEx schemas based on their proposed mappings.

I emphasize that I was torn between a major revisiojn and reject here, so the authors will need to make considerable effort to address my concerns for making me change my mind with a revision.

First of all, I definitely think that present article has potential and that discussed topics are practical and relevant. Especially, considering recent advancements on developing/standardizing languages for expressing constraints on RDF such as ShEx or SHACL (the latter became an official W3C Recommendation as of July 2017).

The present paper, however, is not ready to be published. It's biggest issues are:

presentation & quality - to make the paper more self-contained, at least a short introduction to ShEx explaining its core features/syntax should be provided. Not all readers are familiar with ShEx. Also, there are various parts throughout the entire article that are either hard to follow as they are lacking proper explanations and/or contain typos/formatting issues that could have been easily detected if the paper would have been proof-read more thoroughly.

lack of contribution - just listing a bunch of possible mappings between xsd and shex + providing a PoC implementation that implements those mappings is in my opinion not enough of a contribution for a journal publication. In addition to what's already there I would have expected at least an introduction to ShEx, a critical discussion on the loss of semantics, and a PoC that actually implements >all< proposed transformations. On a side note, your implementation translates min/max In/Exclusive restrictions to cardinality constraints (cf. quantity element).

You can find a scan of my handwritten review at [1] (it was just too much to write everything down). Feel free to contact me on skype (Jose has my contact info) if you have any questions and/or want to discuss my review/remarks in more detail.

br, simon

[1] https://github.com/simonstey/Reviews/blob/master/journal_swj/review_swj1...