Review Comment:
This paper addresses automatic difficulty prediction of MCQ questions. This topic is of importance for the research field of automatic creation of
assessment tools, especially using structured domain knowledge such as ontologies and knowledge graphs. The paper is well written and organized.
My comments are as follows:
1) The main comments I have with respect to the paper are mainly related to the fact that the paper heavily relays on a previously published paper by same authors. Even though authors made a significant effort to distinguish the two papers, it is still confusing and difficult for a reader to identify contribution of this paper from a standpoint of sematnic web.
2) page 1, lines 16-17: The statement "Question difficulty ... determined after an exam is taken ... provides no utility for the generation of new questions" excludes some of iterative or Machine Learning approaches. It should be reconsidered more carefuly.
3) page 1, line 20: The abstract makes a quite general statement while only MCQ questions are considered in the rest of the paper. The abstract should be more specific.
4) page 1, lines 42-46: Difficulty of MCQs is defined later on in this paper in a more formal way. If this is a statement cited from some other work, than a reference should be given. Just a footnote is not enough in my opinion.
5) page 2, lines 47-51: This question is not relevant for semantic web community. It is more important in the context of cognitive science, though. I would sugest reconsidering formulation of the main research questions of the paper.
6) page 2, lines 32-33: It would be a contribution to the community if the question set could be made available online?
7) page 3, lines 44-45: "One point worth mentioning is that underlying difficulty models are not part of most existing question generation approaches". Why MCQ generation approaches that target selected Bloom's taxonomy level are not considered as having underlying difficulty models?. More general, should Bloom's cognitive levels be considered as a type of difficulty levels?
8) page 4, lines 21-24: Similarity is defined in a quite ambiguous way. It is not clear does it concern concepts from an ontology, options in a MCQ, or something more general. Should be more specific.
9) page 4, lines 33-35: Semantic similarity has been addressed in literature before, and the proposed similarity measure should be put into the context of current work and compared to other similarity measures. Next question is whether different possible types of similarity perform differently in difficulty prediction?
10) page 4, lines 8-9: "These limitations motivate us to develop the new difficulty measure described below." It is not correct to state "to develop the new difficulty measure described below" while the measure "was introduced in [7]". In other words, the difficulty measure is not defined in this paper as claimed.
11) page 5, lines 24-26: "Since the question is asking for the most likely diagnosis, the option entity that has the strongest relation to the stem entities is the key." "option entity" should be defined first. It may be assumed that it refers to an ontology entity used for generation of the option. However, it is too bold assumption for a reader to make.
12) page 5, lines 29-30: similar as comment 11: "annotated axiom" should be defined.
13) page 5, lines 21-21: The developed difficulty measures have been applied and tested on a single (though large and important) ontology. How it would deliver on some other ontologies or knowledge graphs in a general case?
14) page 6, lines 41-43: "Each expert reviewed approximately 30 questions belonging to their specialty." How did you decide which question belongs to which specialty? It is a problem on its own, regarding a question classification? The question is probably out of scope of the paper, but may be indicated in text?
14) page 8, lines 1-18: The concerned is whether the number of participants is large enough for obtaining unbiased results? The methodology applied for data analysis is correct and scientifically sound, but the question is could we consider presented results trustworthy if obtained with this (small?) number of participants. Could the minimal number of participants be determined that would guarantee certain level of trust in the obtained results?
|