Automated Generation of Assessment Tests from Domain Ontologies

Tracking #: 1158-2370

Vinu Ellampallil Venugopal
P Sreenivasa Kumar

Responsible editor: 
Michel Dumontier

Submission type: 
Full Paper
OWL-ontologies are structures which are used for representing knowledge of a domain, in the form of logical axioms. Research on pedagogical usefulness of these knowledge structures has gained much attention these days. This is mainly due to the number of on-line ontology repositories and the ease in publishing knowledge in the form of ontologies. One another reason for this trend in research, is due to the changing education-style — more learners prefer to take-up on-line courses, than attending a course in a typical room setup. In this case, assessments — both prerequisite and post-course-requirement evaluations — will be a challenging task. In this paper, we explore an automated technique for generating question items like multiple choice questions (MCQs), from a given domain ontology. Furthermore, we investigate the aspects such as (1) how to find the difficulty-level of a generated MCQ; (2) what are the heuristics to follow to select a small set of MCQs which are relevant to the domain; (3) how to set a test which is having higher, medium or lower hardness level, in detail. We propose novel techniques to address these issues. We tested the applicability of the proposed techniques by generating MCQs from several on-line ontologies, and verified our results in a class-room setup — incorporating real-students and domain-experts.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/Sep/2015
Minor Revision
Review Comment:

The authors present an approach for generating multiple choice questions from domain ontologies. Novel contributions of the paper include the generation of generic factual questions using patterns that involve more than two predicates, a novel generic approach for generating ontology-specific questions, an algorithm for determining the difficulty level of overall questions and an algorithm for determining the difficulty of the question stem.

A lot of work is presented in this paper and it is sometimes difficult to determine which aspects of the work are original and which are based on existing work by the same (or other) authors. Section 6 in particular both summarises existing approaches and introduces novel work without a clear separation between the two. An initial background/existing approaches section followed by the introduction of the original work would be preferable.

Most of the work is presented in abstract notation, which can be quite difficult to parse and therefore interrupts the flow of readability. The use of a consistent set of relevant examples to illustrate concepts throughout the paper would improve greatly readability. There are occasional examples but they are not used consistently. In addition, as the same notations are used throughout the paper, a definition section or table that can easily be referred back to would be very helpful.

The discussion of the evaluation of the new approaches is relatively short and should be expanded upon. In particular, a discussion of the results in context of other work is missing. Including even a small section on this would greatly enhance the quality of the paper.

Minor comments
- contributions listed in section 1 and expanded on in section 2 should reflect the order that they are addressed in in the paper
- figures and tables should be presented in the correct numerical order; currently table 11 appears before tables 9 and 10

Review #2
By Martin Boeker submitted on 22/Oct/2015
Major Revision
Review Comment:

The main objective of the present work is to investigate whether and how formal ontology can be exploited to generate multiple choice questions for educational assessments tests. Automatic generation of assessment test from natural language or formal representation is an important goal in educational and other research domains as manual test construction is time consuming and tedious (most educators would agree).

After sufficiently justifying the relevance of their work, the authors state four main research objectives. The layout of the work is organized as "contributions" to a prior publication of the authors. The four contributions deal with a wide range of theoretical and methodical topics related to the generation of multiple choice questions for assessments in education: (1) the logical structure of multiple choice questions (MCQs), the generation of (2) generic and (3) specific multiple choice questions from OWL ontologies, (4-6) three different forms of heuristics for question set generation, (7) the determination of the difficulty of questions, and how this could be used to (8) control question set generation, (9) distractor generation, and (10) an evaluation based on a complex psychometrical model.

At the current state, the work cannot be fully appraised. Although the framework of the proposed methods might be sufficiently justified (with one exception), the results of automatic question generation process cannot be evaluated: the empirical evaluation of resulting question sets does not fulfill the standards necessary for interpretation.

The authors organized their work following the different topics in a consistent and logical way. Nevertheless the work is difficult to read for the following main problems

1. The authors do not focus on *one* main research question subordinating other topics.
2. In some subchapter the authors repeat (formal) knowledge on the subtopics, which is not used throughout the paper and could be retrieved elsewhere (e.g. formal model of multiple choice questions).
3. In some of the chapters the authors describe their new methods very formally (e.g. with a lot of computer pseudo-code).
4. A consistent real-world example which is used throughout the paper is not introduced.
5. Some of the examples given are nonsense (hasPopulation(arizona, 3232323)... ) or are very trivial (Table 6).
6. Legends to tables and figures are generally not instructive.
7. The authors integrate methodology from different research areas (e.g. ontology, ontology metrics, software programming, psychometrics, education), which cannot be avoided due to the topic of the work, however, requires even more care for clarity.


The authors should address the issues stated above:

First, they should focus their work: what is essential, what is important and what can be left out. Formalism should be reduced to what is necessary. Theory, justifications and methods should be explained clearly. An instructive real-world example should be followed throughout the paper. The authors should try to increase understandability for the readers of different knowledge domains by avoiding theoretical complexity (which only obfuscates the main topics) and try to commit *their message*. If they need, the authors could provide an appendix with background theoretical justification.


Major Compulsory Revisions:

Nearly half of the length of the abstract the authors reason about the motivation of the work. They follow with a description of the objectives, which should be more precise. The abstract does not include information on methods and results. The abstract should include precise information on objectives, methods, results and conclusions.



"Ontologies, the knowledge representation structures which are useful in modeling knowledge ..."

I would suggest to avoid this. Many ontologists would not fully agree with this statement. If you would like to write something general about ontologies, there are some definitions available which are more agreeable.


"Bloom's taxonomy, a classification of cognitive skills ...".

The same as above. Most educators would not agree with this. Bloom's taxonomy is "a construct" not easy to define, but it is not a classification, because the categories are not disjoint (which has been shown in many experiments). Best thing would be to define it as a taxonomy of educational objectives for the cognitive domain.


As already written above, clearly state the research question of this work. Subordinate other topics.


Please provide an argument why *property sequences* which are present in all instances make questions *trivial*, and should therefor be abolished. Why should that be? Think about biomedical ontology (e.g. the Foundational Model of Anatomy): the parthood relations are *essential* to anatomy, they define the topology of anatomy and they are the prevailing relations (properties) in anatomy. Knowledge on anatomy is knowledge about part-of and has-part.

The example provided in the text, (isProducedBy, isDirectedBy), leads not to non-trivial questions. In contrary, the example you provide which should result in "better" question-stems, leads to nonsense question stems (Table 6: none of these stems is actually useful, independently of "Popularity"! From the titles of the works used in these questions the correct answer can be guessed in all instances).


The authors provide an empirical evaluation of the automatic question stem generation techniques based on two methods. In their first approach ontology generated questions were compared to human generated questions by matching over a similarity function.

The second approach tested for "stem hardness" using item response theory. Item response theory is basically a framework to describe characteristics of items of a psychometric test (test theory) which overcomes some limitations of classical test theory (CTT). The drawback of IRT is that even experienced educators and psychologists (and statisticians) have difficulties with the application and interpretation of IRT due to the underlying complex statistical models. Readers without a strong psychometrical background can interpret and appraise data of educational assessment experiments much easier when data are presented first descriptively, then based on classical test theory and then on IRT.


The authors should think about to use the term "difficulty" instead of "hardness". In the educational literature, difficulty is more common in refereing to the corresponding quality of items and scales.


The empirical evaluation is critical to readers for assessing the effectivity and validity of the (new) proposed method. Therefor, it should comply with available standards in science, and in this case with standards for educational and psychological testing (see e.g. "Health measurement scales" 2015, Streiner et al.). Scale and test construction can also be subsumed under the label of diagnostic tests. Different standards are available for the reporting (and the construction) of studies on test accuracy: e.g. the STARD and GRRAS initiatives. Documents describing STARD are available from the equator network website: []; for GRRAS here: Kottner, J., Audigé, L., Brorson, S., Donner, A., Gajewski, B. J.,Hrøbjartsson, A., et al. (2011). Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Journal of Clinical Epidemiology, 64, 96–106.

Not all of the proposed criteria from STARD are applicable to this work, however, without more essential information the reliability of the data and the validity of the experiment cannot be assessed by the reader. This are e.g. the eligibility criteria for participants in the study; how were they selected and how were they recruited? Etc.


A self assessment is (usually) not the appropriate instrument to measure a trait (like cognitive skills) in the context of a (comparative) study on test characteristics. From educational literature it is well known that the human ability for self assessment (on skills) is very limited (better students assess their skill worse than actual skill, and weaker students assess their skill better - so that self assessment tends to be distributed homogeneously, independent from actual skill). In a setting in which a new methodology for assessment/ diagnostic tests is to be tested, objective, established measurement techniques are required (e.g. a Multiple Choice Test) for comparison with the new method.


The sample size for a *reasonable* Item Response Theoretical Analysis (e.g. based on the Rasch Model) should be at least 30 individuals (in this case with three categories should be at least 50); see "Health measurement scales" 2015, Streiner et al. in Chapter 12 which cites []. Only 15 persons were tested in the actual evaluation setting which were further categorized by ability into three groups.

The experiment should be repeated with a sufficient sample size and an independent Multiple Choice Test on Data Structures and Algorithms (DSA). It should be planned, conducted and reported according to reporting guidelines (e.g. STARD).


Minor Compulsory Revisions:

In the discussion the authors have summarized their work and discussed limitation and future research, but they are not discussing their results in view of existing results and literature. The discussion should be improved to reflect the gains, similarities and differences of the new method when compared to existing methods of automatic Multiple Choice Question generation.


Major Compulsory Revisions:

The text The text should be revised by a native speaker of English. The authors left out many articles and frequently used unidiomatic expressions.

Review #3
By Amrapali Zaveri submitted on 06/Dec/2015
Major Revision
Review Comment:

The article “Automated Generation of Assessment Tests from Domain Ontologies” proposes an approach to automatically generate test questions from a given domain ontology. The article is an extension of the authors previous work, which they extend to include methodologies to determine and control the hardness of the question and techniques to generate generic and ontology specific questions.

The methodology is sound and is explained in detail. This approach can have applicability in generating quality educational material. However, my main concern is the evaluation. The experiments lack a lot of significant details, which makes it difficult to judge the significance of the work, such as:
- how many experts were chosen for the evaluation?
- which domain did the experts belong to? In your datasets, there are at least three different domains.
- how were the experts selected?
- how much time did the experts take to generate the questions?
- did the experts mutually agree on all the questions?
- in Table 8, you compare your approach with the random selection but you do not explain the random selection clearly and it is unclear why you do not compare with the experts generated questions.
- for determining the hardness, how did you initially determine the participants trait level?
- the sample size of the participants is extremely low to obtain significant results; in fact, an actual evaluation would be having a set of people answer both the AG and BM sets and then having them determine the hardness score, which can be compared against the automatically generated one. This is mentioned only in the abstract but not at all clear in the Evaluation and Results sections.
- two out of the three ontologies that you perform the evaluation on are developed by your group; instead it would be very interesting to perform this kind of evaluation on an ontology like DBpedia focusing on a particular domain or a biology related ontology
- in Table 3, you also have the much larger “Restaurant” and “Job” ontologies listed but are never evaluated/used.
- there is no discussion on the scalability and performance of the approach
- in section 10.1.1 you say “ the parameter c helps in avoiding questions which are semantically similar”, does that refer to the questions between the 25,50 and 75 sets?
- in section 10.2.2, there is no report of the actual results i.e. number of MCQs that pertain to each difficulty level
- when I looked at the Set A, B and C, the initial questions are the same, which makes the significance of this split of 25,50 and 75 questions unclear. Table 7 clearly shows the generated question count surpasses each requirement; besides, these sets are not even used in calculating the precision and recall. Also, the result files on the project only contain factual questions and not MCQs, which are mentioned in Section 10.2.
- distractors are not used in the evaluation questions
- along with the test sets, providing the result data of the mapping with the AG-sets and the hardness categories would be useful; also provide examples in the text itself

Another important detail missing in this article is the related work. A few relevant works are discussed in the introduction. You should provide a Related Work section and for example, compare your work against “ASSESS — Automatic Self-Assessment Using Linked Data” and other related works.

The paper is well written and easy to follow. However, I have a few suggestions/comments:
- first of all, when it is your own previous work, I find it strange that you refer to it as “their” work.
- an overview figure would be useful in understanding the methodology
- “A generic (*ontology independent*) technique to generate *Ontology-Specific* factual-MCQs.” sounds strange.
- “this paper can be listed” - “this paper are listed”
- “From Restaurant ontology” - “From the Restaurant ontology”
- the link does not work
- “the 3 question-sets” - “the three questions-sets”
- “ontology-Specific” - “ontology-specific”
- “50 percentage” - “50 percent”
- “from DSA ontology” - “from the DSA ontology”
- “Grammaticality between stem, key and distractors is another issue that is not addressed in this paper.” - Please rephrase
- “are singular number” - “are singular numbers”
- “from its intended” - “from their intended”
- “that how to” - “how to
- “work were published” - “work was published”


Please consider the following change in the manuscript,

In Section 6.1.1, the sentence: "In the PSTS calculation, the key-variable is taken as the reference position r for finding the potential-set", should be modified as "In the PSTS calculation, the position of the reference-instance (introduced in Section-4) is taken as r for finding the potential-set."