A Comparative Study of Methods for a Priori Prediction of MCQ Difficulty

Tracking #: 2225-3438

Authors: 
Ghader Kurdi
Jared Leo
Nicolas Matentzoglu
Bijan Parsia
Uli Sattler
Sophie Forge
Gina Donato
Will Dowling

Responsible editor: 
Dagmar Gromann

Submission type: 
Full Paper
Abstract: 
Successful exams require a balance of easy, medium, and difficult questions. Question difficulty is generally either estimated by an expert or determined after an exam is taken. The latter provides no utility for the generation of new questions and the former is expensive both in terms of time and cost. Additionally, it is not known whether expert prediction is indeed a good proxy for estimating question difficulty. In this paper, we analyse and compare two ontology-based measures for difficulty prediction, as well as comparing each measure with expert prediction (by 15 experts) against the exam performance of 12 residents over a corpus of 231 medical case-based questions. We find one ontology-based measure (relation strength indicativeness) to be of comparable performance (accuracy = 47%) to expert prediction (average accuracy = 49%).
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vinu Ellampallil Venugopal submitted on 13/Sep/2019
Suggestion:
Accept
Review Comment:

The authors have addressed all my major comments and the revised version has been much improved. I now recommend the paper for publication in SWJ.

Review #2
Anonymous submitted on 05/Oct/2019
Suggestion:
Minor Revision
Review Comment:

This paper addresses automatic difficulty prediction of MCQ questions. This topic is of importance for the research field of automatic creation of
assessment tools, especially using structured domain knowledge such as ontologies and knowledge graphs. The paper is well written and organized.

My comments are as follows:

1) The main comments I have with respect to the paper are mainly related to the fact that the paper heavily relays on a previously published paper by same authors. Even though authors made a significant effort to distinguish the two papers, it is still confusing and difficult for a reader to identify contribution of this paper from a standpoint of sematnic web.

2) page 1, lines 16-17: The statement "Question difficulty ... determined after an exam is taken ... provides no utility for the generation of new questions" excludes some of iterative or Machine Learning approaches. It should be reconsidered more carefuly.

3) page 1, line 20: The abstract makes a quite general statement while only MCQ questions are considered in the rest of the paper. The abstract should be more specific.

4) page 1, lines 42-46: Difficulty of MCQs is defined later on in this paper in a more formal way. If this is a statement cited from some other work, than a reference should be given. Just a footnote is not enough in my opinion.

5) page 2, lines 47-51: This question is not relevant for semantic web community. It is more important in the context of cognitive science, though. I would sugest reconsidering formulation of the main research questions of the paper.

6) page 2, lines 32-33: It would be a contribution to the community if the question set could be made available online?

7) page 3, lines 44-45: "One point worth mentioning is that underlying difficulty models are not part of most existing question generation approaches". Why MCQ generation approaches that target selected Bloom's taxonomy level are not considered as having underlying difficulty models?. More general, should Bloom's cognitive levels be considered as a type of difficulty levels?

8) page 4, lines 21-24: Similarity is defined in a quite ambiguous way. It is not clear does it concern concepts from an ontology, options in a MCQ, or something more general. Should be more specific.

9) page 4, lines 33-35: Semantic similarity has been addressed in literature before, and the proposed similarity measure should be put into the context of current work and compared to other similarity measures. Next question is whether different possible types of similarity perform differently in difficulty prediction?

10) page 4, lines 8-9: "These limitations motivate us to develop the new difficulty measure described below." It is not correct to state "to develop the new difficulty measure described below" while the measure "was introduced in [7]". In other words, the difficulty measure is not defined in this paper as claimed.

11) page 5, lines 24-26: "Since the question is asking for the most likely diagnosis, the option entity that has the strongest relation to the stem entities is the key." "option entity" should be defined first. It may be assumed that it refers to an ontology entity used for generation of the option. However, it is too bold assumption for a reader to make.

12) page 5, lines 29-30: similar as comment 11: "annotated axiom" should be defined.

13) page 5, lines 21-21: The developed difficulty measures have been applied and tested on a single (though large and important) ontology. How it would deliver on some other ontologies or knowledge graphs in a general case?

14) page 6, lines 41-43: "Each expert reviewed approximately 30 questions belonging to their specialty." How did you decide which question belongs to which specialty? It is a problem on its own, regarding a question classification? The question is probably out of scope of the paper, but may be indicated in text?

14) page 8, lines 1-18: The concerned is whether the number of participants is large enough for obtaining unbiased results? The methodology applied for data analysis is correct and scientifically sound, but the question is could we consider presented results trustworthy if obtained with this (small?) number of participants. Could the minimal number of participants be determined that would guarantee certain level of trust in the obtained results?

Review #3
Anonymous submitted on 25/Oct/2019
Suggestion:
Minor Revision
Review Comment:

The paper compares two ontology-based difficulty measures for multiple choice questions (MCQs). Both these measures are introduced in previously published papers. The current paper compares the measures and provides an analysis of the data collected from conducting a mock examination on a small set of 10-12 respondents and also data about how domain experts have rated the MCQs regarding their difficulty. It also makes use of the data about the performance of experts on the same set of MCQs. The main contribution of the paper is the data analysis and comparison.
The comparison is interesting and is detailed.

The following remarks need to be addressed before the piece can be published.
1. To make the paper self-contained, the authors need to provide a little more details of the similarity-based measure. Subsumers are used in the definition. A brief description on how these are computed can be included.
2. The relation strength indicative (RSI) measure assumes that the ontology is in a certain particular structure. So, how general purpose is this measure?
3. Also, RSI makes heavy use of the “strength” of the relation. It is not clear if this strength is manually provided by the ontology author or it can be automatically computed from the ontology. Clarification is needed.
4. Clarification is required on Table 5. 6th column "percentage correct" seems unnecessary. It appears to be the rounded value of column 4. I expect this column to be same values for (a) and (b); as it is probably the ground truth. Proper explanation needs to be added. Also in the caption for the Table 5, is (b) residents performance?? How RSI measure can be used to get responses? This table is very confusing. Appropriate description needs to be added to properly explain Table 5.
5. In Section 5.2.2 (page 11), it is claimed: “Combining the similarity measure with with stem indicativeness, as explained in Section 3.2, increases the performance on all metrics except for recall on difficult questions as can be seen in Table 6.” However, I did not find any such explanation in Section 3.2. Which row in Table 6 corresponds to this?
6. In Table 6, methods - Random, Weighted, Majority - are mentioned. I think a brief explanation as to what these methods are needs to be included in the text.

Minor:
1. What is the numerator for the expression used in defining disDiff(S,k,d) on Page 5? I think a symbol is missing here.