Automated Generation of Assessment Tests from Domain Ontologies

Tracking #: 1312-2524

Vinu Ellampallil Venugopal
P Sreenivasa Kumar

Responsible editor: 
Michel Dumontier

Submission type: 
Full Paper
We investigate the effectiveness of OWL-DL ontologies in generating multiple choice questions (MCQs) that can be employed for conducting large scale assessments. The details of a prototype system called Automatic Test Generation (ATG) system and its extended version called Extended-ATG system are elaborated in this paper. The ATG system was useful in generating multiple choice question-sets of a required cardinality, from a given formal ontology. This system is further enhanced to include features such as finding the difficulty values of generated MCQs and controlling the overall difficulty-level of question-sets, to form Extended-ATG system. This paper discusses the novel methods adopted to address these new features. While the ATG system uses at most two predicates for generating the stems of MCQs, the E-ATG system has no such limitations and employs several interesting predicate based patterns for stem generation. These predicate patterns are obtained from a detailed empirical study of large real-world question-sets. In addition, the system also incorporates a specific non-pattern based approach which make use of aggregation-like operations, to generate questions that involve superlatives (e.g., highest mountain, largest river etc.). We have tested the applicability and efficacy of the proposed methods by generating MCQs from several online available ontologies, and verified our results in a classroom setup — incorporating real students and domain experts.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Amrapali Zaveri submitted on 13/Feb/2016
Minor Revision
Review Comment:

The authors have satisfactorily answered all the issues that I had raised. In the current version, I only have a few suggestions/comments:
- Add R3-9 response as a footnote in the text itself.
- In the Introduction, you mention that you evaluate your approach with two domains - movies and US geography, however DSA is neither.
- Fix Appendix C formatting.
- Have a native English speaker proof read the document for errors. There are still some minor grammatical errors that should be fixed.

Review #2
By Martin Boeker submitted on 14/Mar/2016
Minor Revision
Review Comment:

In their work, the authors describe a system to generate multiple choice questions (MCQ) from OWL DL ontologies based on their prior work in this domain. The authors describe several modules of an automated question generation and question set selection system. In an evaluation, the authors compare questions generated automatically with questions developed by experts and measured psychometric characteristics of an automatically generated and manually curated test set.

As answer to the reviewer comments, the authors introduced several changes in their manuscript and described them in a letter to the reviewers. Generally, the present paper is better organized than the first version so that prior work can be distinguished from the current contributions. In many parts the authors, clarified the methods of their current approach and the corresponding results.

The work is ordered according to different technical and methodological approaches for automatic question generation, test generation and the evaluation of this approach. In each part of the paper, the authors maintained the classical structure of methods and results. One common short introduction with background and a short discussion section enclose these different parts. Due to the complex structure and content of the paper, it is still difficult to read and appraise. To overview methods and results, they have to be retrieved from each part of the paper. In the view or the reviewer, the authors could further improve the organization of their paper for better readability and further reduce the complexity of the overall paper.

In general, this work describes promising methods based on knowledge representation formalisms to automatically generate multiple choice items and tests. However, in the didactic domain especially in the evaluation of the automatically generated questions, the paper could benefit from more expertise in eduational sciences especially in assessment.

# Title

The title correctly represents the contents of the paper.

# Abstract

All aspects of the paper are mentioned in the abstract. However, for a reader who has not read the paper, it might be difficult to discern and understand the different parts of the paper. It might be easier for the reader if the the authors would use a structured abstract form (objectives - methods - results - conclusion) and enumerate different parts of their methods/ results.

# Introduction

The review of the literature provides theoretical backgrounds of the study and introduces into prior work of the authors. A general outline of the study is provided.

Due to the large scope of their approach, the authors have some difficulties to provide a methodological framework for it. In the view of the reviewer, the authors provide a sound theoretical framework in the ontological and technical domain. However, the theoretical backgrounds in the didactic domain reveal some inaccuracies.

## Major Revisions

* Additionally to the list of "four contributions" to prior work, the authors should clearly state all objectives of their work so that readers can find them at the expected position.
E.g. "The objectives of this work are (1) to describe ... based on prior work, (2) to develop a system implementing ..., and (3) to evaluate the implemented system with ...".

The most (at least an) important part of the work is the empirical study which therefore should be stated as a main objective of this work. In the current version, the evaluation is not mentioned as an objective of this work at all at the expected position.

# Methods

Parts of the methods could have been re-written for better readability with less formalisms as indicated by reviewers in the last review round, especially without pseudocode on page 13.

The evaluation of the automatically generated MC items was described in more detail so that results can be critically appraised by the reader.

In the reviewers opinion, only a larger scale experiment based on *random selection of students and items* can provide conclusive evidence that automatic generation of test items provides items that *perform* as well as manually generated test items under real life conditions. The evaluation of students competencies needs to be part of the experimental design and cannot be used to *select* students prior to the experiment. Researchers analyzing the psychometric properties of items and competencies of the students need to be blinded otherwise probability of bias is very high.

## Major Revisions

* Page 3 section 2.1. last paragraph "..., in our experiments as 4, as it is the standard practice in MCQ tests." This statement is not correct. It might be one standard, however, in many other countries 5 answers is the standard for MC Tests.

* Page 6 last paragraph: "Since there are no rules as such for how a FQ should look like, ... " That is wrong. There has been done substantial work on guides/ rules how MCQ and MCT should be designed (e.g. Haladyna TM, Downing SM, Rodriguez MC. A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. Juli 2002;15(3):309–33; Ware J, Vik T. Quality assurance of item writing: During the introduction of multiple choice questions in medicine for high stakes examinations. Medical Teacher. 1. Januar 2009;31(3):238–43; Haladyna, Thomas M. Developing and validating multiple-choice test items. Routledge, 2012.). There are also prescriptive rules for high stakes examinations how questions must look like. E.g. in German medical high stakes exams only Type A question are allowed which need to fulfill further criteria (e.g. no double negation, clear wording, no cues). The authors should provide a review on this literature and discuss why the rules described there are not relevant as criteria for their item patterns.

# Results

The results are accompanied by specific data. Tables and figures are used efficiently. The labels of figures and tables are illustrative.

As stated in the prior review, parts of the results could have been written for better readability with less formalism.

# Discussion

For a long and complex work, the discussion is proportionally short. However, the authors introduced a section in which they discusse their work comparing with other approaches. Especially from the educational domain much references could have been considered by the authors.

## Major Revisions

* The authors should discuss the limitations of their different approaches for MCQ and MCT generation and especially their evaluation in this section.

* The authors should clearly state in the limitations above, that the design of their study is not appropriate to provide conclusive evidence that the items automatically generated are as effective in assessment than manual generated test items. Although they provide some evidence for equal performance of automatically generated test items, they should plan for a larger scale experiment with random selection of students and items with double blinding.

# References

## Major Revisions

1. The references are not formatted according to the style guide of the journal with second name of the author in the first place.
2. The author name of the last reference is not readable.

## Minor Revisions

* A reference the authors might want to include which compares automatic test item generation with manual test item generation from a medical education journal with very high qualitative standards: Gierl MJ, Lai H. Evaluating the quality of medical multiple-choice items created with automated processes. Med Educ. 1. Juli 2013;47(7):726–33.

# Other comments

## Minor Revisions

* Although revised, the paper should be checked again for language issues. Many grammatical, spelling, word order, and comma position errors should be corrected.