Review Comment:
The main objective of the present work is to investigate whether and how formal ontology can be exploited to generate multiple choice questions for educational assessments tests. Automatic generation of assessment test from natural language or formal representation is an important goal in educational and other research domains as manual test construction is time consuming and tedious (most educators would agree).
After sufficiently justifying the relevance of their work, the authors state four main research objectives. The layout of the work is organized as "contributions" to a prior publication of the authors. The four contributions deal with a wide range of theoretical and methodical topics related to the generation of multiple choice questions for assessments in education: (1) the logical structure of multiple choice questions (MCQs), the generation of (2) generic and (3) specific multiple choice questions from OWL ontologies, (4-6) three different forms of heuristics for question set generation, (7) the determination of the difficulty of questions, and how this could be used to (8) control question set generation, (9) distractor generation, and (10) an evaluation based on a complex psychometrical model.
At the current state, the work cannot be fully appraised. Although the framework of the proposed methods might be sufficiently justified (with one exception), the results of automatic question generation process cannot be evaluated: the empirical evaluation of resulting question sets does not fulfill the standards necessary for interpretation.
The authors organized their work following the different topics in a consistent and logical way. Nevertheless the work is difficult to read for the following main problems
1. The authors do not focus on *one* main research question subordinating other topics.
2. In some subchapter the authors repeat (formal) knowledge on the subtopics, which is not used throughout the paper and could be retrieved elsewhere (e.g. formal model of multiple choice questions).
3. In some of the chapters the authors describe their new methods very formally (e.g. with a lot of computer pseudo-code).
4. A consistent real-world example which is used throughout the paper is not introduced.
5. Some of the examples given are nonsense (hasPopulation(arizona, 3232323)... ) or are very trivial (Table 6).
6. Legends to tables and figures are generally not instructive.
7. The authors integrate methodology from different research areas (e.g. ontology, ontology metrics, software programming, psychometrics, education), which cannot be avoided due to the topic of the work, however, requires even more care for clarity.
MAJOR REVISION (GENERAL)
The authors should address the issues stated above:
First, they should focus their work: what is essential, what is important and what can be left out. Formalism should be reduced to what is necessary. Theory, justifications and methods should be explained clearly. An instructive real-world example should be followed throughout the paper. The authors should try to increase understandability for the readers of different knowledge domains by avoiding theoretical complexity (which only obfuscates the main topics) and try to commit *their message*. If they need, the authors could provide an appendix with background theoretical justification.
ABSTRACT
Major Compulsory Revisions:
Nearly half of the length of the abstract the authors reason about the motivation of the work. They follow with a description of the objectives, which should be more precise. The abstract does not include information on methods and results. The abstract should include precise information on objectives, methods, results and conclusions.
INTRODUCTION
MAJOR REVISON
"Ontologies, the knowledge representation structures which are useful in modeling knowledge ..."
I would suggest to avoid this. Many ontologists would not fully agree with this statement. If you would like to write something general about ontologies, there are some definitions available which are more agreeable.
MAJOR REVISON
"Bloom's taxonomy, a classification of cognitive skills ...".
The same as above. Most educators would not agree with this. Bloom's taxonomy is "a construct" not easy to define, but it is not a classification, because the categories are not disjoint (which has been shown in many experiments). Best thing would be to define it as a taxonomy of educational objectives for the cognitive domain.
MAJOR REVISON
As already written above, clearly state the research question of this work. Subordinate other topics.
6. QUESTION-SET GENERATION HEURISTICS
Please provide an argument why *property sequences* which are present in all instances make questions *trivial*, and should therefor be abolished. Why should that be? Think about biomedical ontology (e.g. the Foundational Model of Anatomy): the parthood relations are *essential* to anatomy, they define the topology of anatomy and they are the prevailing relations (properties) in anatomy. Knowledge on anatomy is knowledge about part-of and has-part.
The example provided in the text, (isProducedBy, isDirectedBy), leads not to non-trivial questions. In contrary, the example you provide which should result in "better" question-stems, leads to nonsense question stems (Table 6: none of these stems is actually useful, independently of "Popularity"! From the titles of the works used in these questions the correct answer can be guessed in all instances).
10. EVALUATION
The authors provide an empirical evaluation of the automatic question stem generation techniques based on two methods. In their first approach ontology generated questions were compared to human generated questions by matching over a similarity function.
The second approach tested for "stem hardness" using item response theory. Item response theory is basically a framework to describe characteristics of items of a psychometric test (test theory) which overcomes some limitations of classical test theory (CTT). The drawback of IRT is that even experienced educators and psychologists (and statisticians) have difficulties with the application and interpretation of IRT due to the underlying complex statistical models. Readers without a strong psychometrical background can interpret and appraise data of educational assessment experiments much easier when data are presented first descriptively, then based on classical test theory and then on IRT.
MINOR REVISION:
The authors should think about to use the term "difficulty" instead of "hardness". In the educational literature, difficulty is more common in refereing to the corresponding quality of items and scales.
MAJOR COMPULSORY REVISION:
The empirical evaluation is critical to readers for assessing the effectivity and validity of the (new) proposed method. Therefor, it should comply with available standards in science, and in this case with standards for educational and psychological testing (see e.g. "Health measurement scales" 2015, Streiner et al.). Scale and test construction can also be subsumed under the label of diagnostic tests. Different standards are available for the reporting (and the construction) of studies on test accuracy: e.g. the STARD and GRRAS initiatives. Documents describing STARD are available from the equator network website: [http://www.equator-network.org/]; for GRRAS here: Kottner, J., Audigé, L., Brorson, S., Donner, A., Gajewski, B. J.,Hrøbjartsson, A., et al. (2011). Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Journal of Clinical Epidemiology, 64, 96–106.
Not all of the proposed criteria from STARD are applicable to this work, however, without more essential information the reliability of the data and the validity of the experiment cannot be assessed by the reader. This are e.g. the eligibility criteria for participants in the study; how were they selected and how were they recruited? Etc.
MAJOR COMPULSORY REVISION
A self assessment is (usually) not the appropriate instrument to measure a trait (like cognitive skills) in the context of a (comparative) study on test characteristics. From educational literature it is well known that the human ability for self assessment (on skills) is very limited (better students assess their skill worse than actual skill, and weaker students assess their skill better - so that self assessment tends to be distributed homogeneously, independent from actual skill). In a setting in which a new methodology for assessment/ diagnostic tests is to be tested, objective, established measurement techniques are required (e.g. a Multiple Choice Test) for comparison with the new method.
MAJOR COMPULSORY REVISION
The sample size for a *reasonable* Item Response Theoretical Analysis (e.g. based on the Rasch Model) should be at least 30 individuals (in this case with three categories should be at least 50); see "Health measurement scales" 2015, Streiner et al. in Chapter 12 which cites [http://www.rasch.org/rmt/rmt74m.htm]. Only 15 persons were tested in the actual evaluation setting which were further categorized by ability into three groups.
The experiment should be repeated with a sufficient sample size and an independent Multiple Choice Test on Data Structures and Algorithms (DSA). It should be planned, conducted and reported according to reporting guidelines (e.g. STARD).
DISCUSSION
Minor Compulsory Revisions:
In the discussion the authors have summarized their work and discussed limitation and future research, but they are not discussing their results in view of existing results and literature. The discussion should be improved to reflect the gains, similarities and differences of the new method when compared to existing methods of automatic Multiple Choice Question generation.
LANGUAGE:
Major Compulsory Revisions:
The text The text should be revised by a native speaker of English. The authors left out many articles and frequently used unidiomatic expressions.
|
Comments
For the reviewers: identified a mistake in a sentance
Please consider the following change in the manuscript,
In Section 6.1.1, the sentence: "In the PSTS calculation, the key-variable is taken as the reference position r for finding the potential-set", should be modified as "In the PSTS calculation, the position of the reference-instance (introduced in Section-4) is taken as r for finding the potential-set."