Review Comment:
This is a reresubmission. Therefore I will only focus on the issues I raised in my previous reviews:
* The still unclear protocol for the first experiment, the evaluation of the CRF features. In particular, the embeddings have been trained on the complete dataset, which
makes the comparison to baselines potentially unfair. Therefore, the authors
were asked to check the experimental protocol in general.
The authors (as far as I read the feedback) argue that this is fine. More precisely
they touch (1) upon training test split resp. cross-validation. But this was not my point. My point is that the embeddings used were learned on the complete dataset.
This induced (no matter how you split a dataset) a feedback loop from any test set
to the training set for any machine learning system that is using the trained embeddings.
In turn, the machine learning system has seen (at least implicitly) the training set.
This is an unfair comparison when comparing to classifiers that are not using the
embeddings but only the splitting of the dataset, since they would not have seen
(at least implicitly) the training set.
In turn their argument of “CRFemb doesn’t contain any direct additional knowledge about the labels of the tokens of this product. It only contains additional knowledge about the similarity between the tokens” is more relevant. And this is were I am confused.
Even distances among tokens are a feedback that provides extra knowledge. To be more precise, the word2vec model is capturing the manifold structure of the complete dataset.
Imagine to use a k nearest neighbour classifier (with fixed metric) instead of a CRF.
Here it is apparent that with the very same setup as the in the paper, the kNN has
been actually “trained” on the complete dataset. Regarding reference [2], to be honest, I think they have the same potential problem of a feedback loop from test to training. It is stated that only
the l2 regularisation coefficients of the CRF have been cross validated.
Anyhow, maybe the "solution" is the following. We add a line saying "Please not that the embeddings were trained on the complete dataset. The embeddings are the categories and are likely to provide useful feedback, akin to a semi-supervised learning learning setting." This way, we make the reader aware of the situation.
Anyhow, I leave this decision to the action editor. In my opinion, the cleanest set up would be any data split and then the whole training (including embedding generation) is done on the training data only (whether cross-validated or just training test splits).
About the “significance” issue. Please bag my pardon, but I was not talking e.g. about Table 4. On page 11, it is said, that there are not significant but it is not said, which
test was used. Maybe also a McN test, but this is unclear.
|