Difficulty-level Modeling of Ontology-based Factual Questions

Tracking #: 1712-2924

Vinu Ellampallil Venugopal
P Sreenivasa Kumar

Responsible editor: 
Michel Dumontier

Submission type: 
Full Paper
Semantics-based knowledge representations such as ontologies are found to be very useful in automatically generating meaningful factual questions. Determining the difficulty-level of these system generated questions is helpful to effectively utilize them in various educational and professional applications. The existing approaches for finding the difficulty-level of factual questions are very simple and are limited to a few basic principles. We propose a new methodology for this problem by considering an educational theory called Item Response Theory (IRT). In the IRT, knowledge proficiency of end users (learners) are considered for assigning difficulty-levels, because of the assumptions that a given question is perceived differently by learners of various proficiencies. We have done a detailed study on the features/factors of a question statement which could possibly determine its difficulty-level for three learner categories (experts, intermediates, and beginners). We formulate ontology-based metrics for the same. We then train three logistic regression models to predict the difficulty-level corresponding to the three learner categories. The output of these models is interpreted using the IRT to find the question’s overall difficulty-level. The performance of the models based on cross-validation is found to be satisfactory and, the predicted difficulty-levels of questions (chosen from four domains) were found to be close to their actual difficulty-levels determined by domain experts. Comparison with the state-of-the-art method shows an improvement of 8.5% in correctly predicating the difficulty-levels of benchmark questions.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/Oct/2017
Minor Revision
Review Comment:

This paper proposes a number of ontology metrics for determining the difficulty of ontology-based factual questions and ranks their influence on predicting the question difficulty using a supervised learning algorithm trained on the data obtained from students and domain experts.

The paper presents new and potentially interesting results in the subject area, but it is lacking in some details which could help the reader understand the context, the methodology and the significance of the results.

To further improve the paper, the author should consider the following points:

1. The theoretical grounding for the four metrics should be introduced and discussed - why these four? Are there any other properties that could be used for predicting the difficulty (completeness)?
2. Is coherence a relation between individuals and concepts (pg. 8 3rd par) or between individuals (pg. 8 last par)
3. Clarify the definition of the overall depthRatio (pg 9 2nd par) - maybe with a formula?
4. The training data consisted of a set of 520 questions. How many students and domain experts (more than 5?) participated in the classification? Was the classification done manually or by using the IRT algorithm (pg. 4) ?
5. Table 1 shows that Coherence and Popularity are less influential for intermediate learners according to the IG and RF algorithms and similarly Coherence is less influential for beginners in two out of three algorithms - these results are not discussed?
6. How does the performance of the regression models compare with the similar results from the literature?
7. Where does the 20.5% improvement mentioned in the (corrected) Abstract comes from ? Is it the difference between the correlation coefficients between the E-ATG and the new method? If so, it should be stated explicitly in section 7.
8. Some of the content from the section 6.2 (Unclassifiable questions) could be moved after Table 1 in Section 3 when the method is introduced, as the reader is left thinking what happens to all other cases (“unclassifiable questions”)
9. Typos should be corrected: e.g. pg. 3 fist line: tuple55s
10. The limitations should include the lack of other approaches for comparison?

Review #2
By Dominic Seyler submitted on 20/Nov/2017
Review Comment:

Title: Difficulty-level Modeling of Ontology-based Factual Questions
Authors: Vinu E.V∗ and P Sreenivasa Kumar

This paper proposes a novel methodology to estimate difficulty-levels of automatically
generated questions from ontologies or knowledge bases. The methodology is based on
educational theory (Item Response Theory). Leveraging this theory, the authors design and
evaluate features for three learner categories (expert, intermediate, beginner) and three
difficulty levels (high, medium, low). The authors train a classifier for each learner
category and decide a question's difficulty level as an ensemble of these classifiers.

* difficulty features are based on educational theory
* in-depth study of effectiveness of features
* classifier considers proficiency level of learner
* paper very well written

* no performance comparison to models that are not proposed by authors
* evaluation does not state performance of other system just relative improvement
* evaluation set is extremely small (24 questions) it is therefore questionable if
the results are representative

Overall, the biggest contribution this paper makes is the incorporation of educational
theory in the context of feature construction for the modeling of automatic difficulty
assessment for questions generated from an ontology. A weakness of the work is the
evaluation of the effectiveness of the classifier, since the dataset is extremely small
and might not be representative. However, the pros significantly outweigh the cons.


The citation

Dominic Seyler, Mohamed Yahya, and Klaus Berberich. Knowledge questions from
knowledge graphs. CoRR, abs/1610.09935, 2016.

should be

Dominic Seyler, Mohamed Yahya, and Klaus Berberich. Knowledge Questions from
Knowledge Graphs. ICTIR (2017).

Review #3
Anonymous submitted on 14/Dec/2017
Major Revision
Review Comment:

(Our overall impression is based on UK scaling, i.e., about a Merit for an MSc (appropriately adjusted).)

(This review is the result of a collaboration between a senior academic and a PhD student working in the area.)

(Note that the review is written in light LaTeX and should compile standalone except for the bibliography:

[1] T. Alsubait, B. Parsia, and U. Sattler. Generating multiple choice questions from ontologies: Lessons learnt. In OWLED, pages 73–84, 2014.
[2] G. T. Brown and H. H. Abdulnabi. Evaluating the quality of higher educa- tion instructor-constructed multiple-choice tests: Impact on student grades. In Frontiers in Education, volume 2, page 24. Frontiers, 2017.
[3] N. Karamanis, L. A. Ha, and R. Mitkov. Generating multiple-choice test items from medical text: A pilot study. In Proceedings of the Fourth Interna- tional Natural Language Generation Conference, pages 111–113. Association for Computational Linguistics, 2006.
[4] J. D. Kibble and T. Johnson. Are faculty predictions or item taxonomies use- ful for estimating the outcome of multiple-choice examinations? Advances in physiology education, 35(4):396–401, 2011.)


The paper presents an ontology-based approach for predicting the difficulty of short answer factual questions taking into account the knowledge level of learners. Four features were proposed and the corresponding ontology based measures were defined. A prediction model that relies on the proposed features was developed and an evaluation of the prediction model is finally reported.

The major contribution of the paper is the definition of a new set of features that can be used for predicting the difficulty of short answer factual questions. Taking into account learners’ knowledge level in predicting difficulty is another distinguishing feature of the work presented. This is especially important in adaptive learning systems where materials need to be adapted to learner levels. The prediction methodology seems feasible. With regards to the presentation, the paper is well organised and easy to follow.

On the downside, the evaluation methodology and the result analysis are not reported sufficiently. There are also aspects of the data that were not considered in developing and evaluating the prediction model such as the distribution of difficulty levels and other aspects that I explain below. I could not interpret the results based on the reported information. Therefore, I have concerns about the implementation of the prediction model and the claims made about difficulty prediction. In particular, I believe that the claims ``The performance of the models based on cross-validation is found to be satisfactory” and ``Comparison with the state-of-the-art method shows an improvement of 8.5\% in correctly predicting the difficulty-levels of benchmark questions'' need additional support.

\paragraph{Recommendation} Major correction. I believe that, at least, deeper analysis of the data is required and the evaluation sections need to be rewritten. Collecting additional data might also be needed.

\section{Major remarks}
\item I assume that the aim is to approximate difficulty as indicated by student performance (Rash difficulty). However, the authors seem to imply that expert prediction is an accurate proxy for student performance which has been questioned in several studies (for examples, see \cite{kibble2011faculty}). This is apparent from the training data where observations about student performance and expert prediction are mixed together in order to increase sample size. In addition, the automatic prediction was compared with domain expert prediction as indicated by: “the predicted difficulty-levels of questions (chosen from four domains) were found to be close to their actual difficulty-levels determined by domain experts”. The target difficulty (student performance, expert prediction, or both) needs to be stated clearly. The training data and the evaluation need to be aligned with the stated goal(s). If the goal is to predict student performance, expert prediction and student performance should not be mixed together without a justification. Minimally, the agreement between them needs to be checked on a subset of the data. Mixing both difficulty metrics together seems plausible in cases where there is a large agreement. However, this needs more thoughts about, and discussion of, the implications.

\item Section 6, training data paragraph: A sample of 520 questions were selected for training. However, relevant practical information has not been reported. This includes:
\item Why 520 questions? and is this enough as training data?
\item How are questions selected (e.g. random sample, stratified sample)?
\item What is the distribution of difficulty in the training sample?
\item What is the distribution of proficiency levels in the training sample?
\item Does the training set contain enough questions that capture all difficulty and learner levels?
\item How many observations about student performance and how many observations about expert prediction are there in the training set?
\item Section 6, training data paragraph: The authors mentioned that difficulty has been obtained in a classroom setting by using IRT but did not mention how were they recruited and how many students were involved. The literature suggests a large number of students (about 500 students) to use IRT (See: \cite{brown2017evaluating}). Due to the difficulty of obtaining participants, I expect that the difficulty information has been calculated based on a much smaller number of students. A discussion of why IRT based on a small cohort is expected to be accurate and whether using a simpler difficulty metric such as percentage correct was considered or not.
\item Section 6, training data paragraph: It is necessary to give more details about how expert prediction data were collected.
\item How were the experts selected?
\item Since the authors used questions from four different domains (ontologies) and stated that each question was evaluated by 5 experts, do they have 5 experts for each domain?
\item What were they asked to predict (d, nd)? Was this required for each type of learners (e.g. q1 is difficult for beginners, not difficult for intermediate learners, not difficult for experts)?
\item Were they required to answer the questions as well?
\item What was their agreement on prediction?
\item How long did they spend on each question? (this is particularity helpful for future studies)
\item Section 6.1, paragraph 1: The authors reported an accuracy of 76.73\%, 78.6\% and 84.23\% for their three regression models. However, accuracy by its own does not show the full picture. What about the performance of the models on each class (performance on d, and performance on nd). This is especially important if the distribution of the classes is skewed. What about models' performance in predicting expert prediction, and student performance. Other metrics that can be considered are precision, recall, and f-measure.

\item Section 6.2, paragraph 1: The authors investigated the percentage of non-classifiable cases by analyzing questions generated from five ontologies. The relation between this set of questions (from five ontologies) and the set of questions used in training (from four ontologies) is not clear. Are the questions investigated in this section different from the questions used for training and evaluation? If you have more data, why these data have not been used for building, or evaluating the models?
\item Section 7, paragraph 2: the authors stated that ``Twenty four representative questions, selected from 128213 generated questions, were utilized for the comparison.” What is meant by “representative” need to be defined. For self-containment, the selection process needs to be outlined.

\item Section 7.1, paragraph 1: The author claimed 8.5\% improvement in prediction using the new set of features. However, due to the small sample size (24 questions), I have concerns about the generalisability of the results. This need to be discussed and mentioned in the limitations.

\section{Some minor remarks}
\item Abstract: The authors stated that previous approaches suffer from being simple. Simplicity is not the real problem and the focus should be on the performance of previous approaches.

\item Introduction, paragraph 3: the authors stated that ``questions that are generated from raw text are suitable only for language learning tasks”. Several text-based approaches generate questions that are not intended for language testing. For example, see \cite{karamanis2006generating} and
work done by ``Michael Heilman”.

\item Section 2.1, paragraph 4: The authors stated that they studied ``all the possible generic question patterns that are useful in generating common factual questions". I believe that the number of question patterns is infinite and therefore the previous statement need to be quantified.

\item Section 4, paragraph 1: it is stated that the similarity theory has been applied to
analogy type MCQs. The theory has been applied to other types of questions. For more details, see the paper \cite{alsubait2014generating}.

\item Figure 1: What is the input for each classifier? How are features extracted?

\item Section 6, training data paragraph: according to the authors, questions used in the training set were generated from four ontologies. While a reader may look up information about these ontologies in the project website, it would be better to give some description about these ontologies and the generated questions locally (e.g. their size, whether they are hand-crafted or not, how many questions were selected per ontology) to make the paper more self-contained. If these ontologies were hand-crafted, this needs to be mentioned as a limitation.

\item Section 6, feature selection paragraph: What makes the selected feature selection methods `popular''? Was this based on the literature?

\item Section 6, feature selection paragraph: I assume that these feature selection methods take the correlation between the features into account, are they? Are there some correlated features?

\item Section 6.1, paragraph 2: The authors reported that their method correctly classified about 77\% of questions. Out of the remaining 23\%, how many were misclassified and how many were non classifiable?

\item Section 6.2, paragraph 2: Are there any questions where the actual difficulty for different learner levels was unexpected? For examples:
\item questions that were easy for beginners but difficult for experts,
\item questions that were easy for intermediate learners but difficult for beginners and experts.
Any observations about the quality of these questions.

\item Section 7, paragraph 2: Using the term ``actual difficulty” is ambiguous unless it is defined earlier (e.g. Rasch difficulty and actual difficulty are used interchangeably).

\item Section 7, paragraph 2: The authors reported a correlation of 67\% between the predicted difficulty and actual difficulty. Providing the number of questions predicted correctly as in the following paragraph will make the comparison easier.

\item Conclusion, last paragraph: One of the limitations mentioned is that the method has been used on medium-size ontologies. Investigating its performance with large-size ontologies is stated as a future research area. Investigating the methods with small ontologies is also needed. I suspect that deriving the metrics from small ontology could give a worse prediction. For example, in small ontologies, the inferred class hierarchy is expected to be shallower and therefore the accuracy of `specificity' metric will be affected. A discussion of this and other similar ontology characteristics that affect metric performance is valuable to add.

\item There are different places where numbers need to be presented in order to support the claims made:
\item Abstract: ``... is found to be \textbf{satisfactory}, what is the performance (in numbers)?

\item Abstract: ``8.5\% in correctly predicating the difficulty-levels of \textbf{benchmark} questions", what is the size of the benchmark?

\item Introduction, paragraph 4: ``In the E-ATG system, \textbf{a state-of-the-art} QG system ..", what makes it the state of the art? how does it perform compared to others (in numbers)?

\item Introduction, paragraph 4: ``we have proposed an \textbf{interesting} method for ...'', what makes it interesting? how does it differ from existing approaches? how does it perform (in numbers)?
\item Introduction, paragraph 4: ``Even though this method can correctly predict the
difficulty-levels to \textbf{a large extent}", how does it perform (in numbers)? any observations about cases where the method fails?




Dear Reviewers,
I have identified a typo in the manuscript. I kindly request you to take the following minor change into consideration while reviewing the paper.

In the Abstract, instead of "8.5% improved" it should have been "20.5%" (from 67% to 87.5%) -- the same mistake has happened at the conclusion section as well.

Thanking you.