A Machine Learning Approach for Product Matching and Categorization

Tracking #: 1664-2876

Petar Ristoski
Petar Petrovski
Peter Mika
Heiko Paulheim

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Consumers today have the option to purchase products from thousands of e-shops. However, the completeness of the product specifications and the taxonomies used for organizing the products differ across different e-shops. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed. In this paper, we present an approach that leverages neural language models and deep learning techniques in combination with standard classification approaches for product matching and categorization. In our approach we use structured product data as supervision for training feature extraction models able to extract attribute-value pairs from textual product descriptions. To minimize the need for lots of data for supervision, we use neural language models to produce word embeddings from large quantities of publicly available product data marked up with Microdata, which boost the performance of the feature extraction model, thus leading to better product matching and categorization performances. Furthermore, we use a deep Convolutional Neural Network to produce image embeddings from product images, which further improve the results on both tasks.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Kristian Kersting submitted on 01/Aug/2017
Minor Revision
Review Comment:

This is a reresubmission. Therefore I will only focus on the issues I raised in my previous reviews:

* The still unclear protocol for the first experiment, the evaluation of the CRF features. In particular, the embeddings have been trained on the complete dataset, which
makes the comparison to baselines potentially unfair. Therefore, the authors
were asked to check the experimental protocol in general.

The authors (as far as I read the feedback) argue that this is fine. More precisely
they touch (1) upon training test split resp. cross-validation. But this was not my point. My point is that the embeddings used were learned on the complete dataset.
This induced (no matter how you split a dataset) a feedback loop from any test set
to the training set for any machine learning system that is using the trained embeddings.
In turn, the machine learning system has seen (at least implicitly) the training set.
This is an unfair comparison when comparing to classifiers that are not using the
embeddings but only the splitting of the dataset, since they would not have seen
(at least implicitly) the training set.

In turn their argument of “CRFemb doesn’t contain any direct additional knowledge about the labels of the tokens of this product. It only contains additional knowledge about the similarity between the tokens” is more relevant. And this is were I am confused.
Even distances among tokens are a feedback that provides extra knowledge. To be more precise, the word2vec model is capturing the manifold structure of the complete dataset.
Imagine to use a k nearest neighbour classifier (with fixed metric) instead of a CRF.
Here it is apparent that with the very same setup as the in the paper, the kNN has
been actually “trained” on the complete dataset. Regarding reference [2], to be honest, I think they have the same potential problem of a feedback loop from test to training. It is stated that only
the l2 regularisation coefficients of the CRF have been cross validated.

Anyhow, maybe the "solution" is the following. We add a line saying "Please not that the embeddings were trained on the complete dataset. The embeddings are the categories and are likely to provide useful feedback, akin to a semi-supervised learning learning setting." This way, we make the reader aware of the situation.

Anyhow, I leave this decision to the action editor. In my opinion, the cleanest set up would be any data split and then the whole training (including embedding generation) is done on the training data only (whether cross-validated or just training test splits).

About the “significance” issue. Please bag my pardon, but I was not talking e.g. about Table 4. On page 11, it is said, that there are not significant but it is not said, which
test was used. Maybe also a McN test, but this is unclear.

Review #2
Anonymous submitted on 01/Aug/2017
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The authors have addressed all comments in a satisfactory way.