Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.
The paper is a revision of a previous paper. It is not entirely clear
to me what has been changed compared to the previous version. From the
last review my single point was concerning the evaluation metrics and
the unusual use of ROUGE. As the authors note in their response ROUGE
has been used in the WebNLG dataset. But this has been for
graph-to-text evaluation. My concern is still that
in a typical knowledge graph the individual triples are not
ordered. The order of triples of the reference data is somewhat
arbitrary. When the ROUGE-2 and ROUGE-L are used for evaluation the
reported result will be dependent on the arbitrary
ordering. Specifically, the columns R-2 and R-L in Table 4, Figure 2,
Figure 3 and Figure 4 will change depending on the ordering of the
triples for a specific prediction, - if my understanding of the method
is correct.
As an example consider the following "result" with two generated
triples in different ordering. In the first case ordering is the same
between the target and the prediction and the ROUGE score are 1.
In the second case, the ordering has changed and the ROUGE-2 score is
now 0.8 and the ROUGE-L score is now 0.5. This is a considerable
difference.
>>> rouge = evaluate.load('rouge')
>>> rouge.compute(predictions=["[(dog # isA # animal) | (Rome # CapitalOf # Italy)]"], references=["[(dog # isA # animal) | (Rome # CapitalOf # Italy)]"])
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
>>> rouge.compute(predictions=["[(dog # isA # animal) | (Rome # CapitalOf # Italy)]"], references=["[(Rome # CapitalOf # Italy) | (dog # isA # animal)]"])
{'rouge1': 1.0, 'rouge2': 0.8000000000000002, 'rougeL': 0.5, 'rougeLsum': 0.5}
The authors seem not to be concerned about this issue. And it could
be a major issue.
Rereading the section 2.3, I now also see that the equations and
description of the equation are either questionable or not clear. The
ROUGE measures are not measures on sets, but on counts. An example with
"dog dog" predicted where the target is "dog" shows a ROUGE-1 score
that is not 1.
>>> rouge.compute(predictions=["dog dog"], references=["dog"])
{'rouge1': 0.6666666666666666, 'rouge2': 0.0, 'rougeL': 0.6666666666666666, 'rougeLsum': 0.6666666666666666}
If ROUGE-1 were based on sets of unigrams we would have expected it
to be 1. Either the notation and explanation is unclear or it is
wrong, - or I am misunderstanding fundamentally how the evaluation is
done.
Minor issue:
Page 3, line 14: unresolved reference
|