A Machine Learning Approach for Product Matching and Categorization

Tracking #: 1470-2682

Petar Ristoski
Petar Petrovski
Peter Mika
Heiko Paulheim

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Consumers today have the option to purchase products from thousands of e-shops. However, the completeness of the product specifications and the taxonomies used for organizing the products differ across different e-shops. To improve the consumer experience, approaches for product integration on the Web are needed. In this paper, we present an approach that leverages deep learning techniques in combination with standard classification approaches for product matching and categorization. In our approach we use structured product data as supervision for training feature extraction models able to extract attribute-value pairs from textual product descriptions. To minimize the need for lots of data for supervision, we use neural language models to produce word embeddings from large quantities of publicly available product data marked up with Microdata, which boost the performance of the feature extraction model, thus leading to better product matching and categorization performances. Furthermore, we use a deep Convolutional Neural Network to produce image embeddings from product images, which further improve the results on both tasks.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 08/Dec/2016
Major Revision
Review Comment:

The paper has invaluable work behind that brings to novel results and methods but it also has several flaws (especially for what the quality of writing is concerned) that need to be fixed.

The paper, as it is, is not easy to understand and its structure should be improved so that the reader can have a better overall vision and understanding.

Image Feature Extraction: nothing has been mentioned about the computational complexity of this task. Is it feasible for a live system or should it be run offline? Please include technical details to let the reader understand better what its employment would imply.

Please explain how you trained four different classifiers. Does each of them take the same similarity feature vectors? If yes, why four? If no, please explain the input and the output of each classifier and why it has been chosen.

In general, examples with real data would help the reader with the understanding and to follow the thread of the paper. Bottom line: simple examples should be included where possible. For example in section 3.1, one example would help to clarify what the problem statement really means. For example, Fig. 1 is hard to understand without a few examples.

The gold standard built from the WDC dataset might look generated ad-hoc. Please explain more.
Table 4 is missing and this does not help with the understanding of the related section.

In section 5 I totally got lost. I could not understand what each section and subsection evaluated. Given the large amount of experiments and evaluation, section 5 should start with explaining the organisation of its subsections indicating what each subsection discusses.
Same for sections 6 and 7.

In the evaluation of products matching, why is only CRF been evaluated in section 5.2? Please explain.

“All the experiments that did not finish within ten days, or that have run out of memory are marked with “\”.”. If the reader does not know anything about the provided hardware and how the system has been developed, this sentence provides only doubts and adds entropy.
Please include technical details about used hardware and software.

The gold standard is generated by manually identifying matching products in the whole dataset.
Two entities are labeled as matching products if both entities contain enough information to be uniquely identified, and both entities point to the same product.
The authors should provide more details and evidence of how this setting is not biased (who annotated the gold standard? External persons or the authors themselves? Which rules or best practices have been followed?)

Therefore, we manually labeled all the attribute- value pairs in 50 products of each category on WDC products
Same applies here. What actions have been taken to avoid any bias?

To evaluate the effectiveness of the product matching approach we use the standard performance measures, i.e., Precision (P), Recall (R) and F-score (F1).
Please explain how Precision and Recall have been computed in your specific case.

It is still not clear to me how the image features can help improving the results if many products use the same image as brand. Please can you explain it better?

A link to a demo of the system, or its online version or anyway something (technical and implementation details, language programming used, framework, platform, etc.) that can prove somehow how the system has been built should be included as well. Also, used datasets, gold standard and the other used resources should be linked and accessible. Right now it looks like a closed box accessible to very few people working in exactly the same problem and, more importantly, not easy to understand.

typos: respectivly (section Introduction)
Page 4: the items related to the four feature extraction models include a closed parenthesis which has never been opened
Pag 7: normalzation
Pag 7: To generate the feature vectors for each instance, after the features from the text are extracted, the value of each feature is tokenized, lowerca
Pag. 9: insigts (Section 5.3)

Probably there are more. Please pass it through a grammar checking system to spot other typos and fix them.

Review #2
Anonymous submitted on 27/Dec/2016
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The submission presents results for two related categorization tasks: Whether two products advertised by different Web vendors (i.e. in potentially different ways) are the same or not, and under which category of a simple product hierarchy a given product can be subsumed. The main approach is to combine elements from supervised and unsupervised learning, “classical” classifier learning with neural-network/deep-learning approaches, and textual cues with image cues. The main result is that the addition of these latter techniques increases the quality (as measured by standard measures) of the results.


The paper is an extension of a recent conference paper by the same authors. The additions over and above the conference paper are limited, but the increment is within the bounds of many journals’ requirements on what should be added to a conference paper to make it an acceptable journal paper.

In terms of content, the originality is limited. The product matching and categorization tasks themselves have been studied extensively for some years. There are real-world examples of the application on which this paper focusses, although they are admittedly far from perfect.

In terms of methods, the originality is limited. The paper applies known methods, albeit in a creative way that is well-suited to the particular problem (product matching and categorization for search goods) that the empirical evaluation tackles.


The general task this paper deals with is wide and potentially interesting also for a wider audience.

However, the authors delimit the task that they actually address strongly. They only look at search goods - and search goods most of which are described by brand and type name, such that matching becomes a near-trivial shallow linguistic processing in which the letter-number combinations that identify a particular mobile phone etc. sometimes have a hyphen and sometimes not, and in which the brand is sometimes named before and sometimes after the type. (This characterization is based, among other things, on the example shown on the WDC website, and I am aware that it is a simplification that will not apply to all pairs of products. Still, the matching task for these products is substantially simpler than that for other products.) The finding, in the paper (p.10 bottom), that the product name field is the best feature for the matching supports this suspicion. And the question arises to what extent the findings that results become better than with other methods depends on this restriction.

The finding about image features is potentially interesting, although it mostly confirms what one knows about the marketing of these goods in saturated markets: for example, that all smartphones look rather similar, in particular those by the same brand, and that product photography aesthetics are quite homogeneous, such that the main differentiation between brands are things such as the “look-and-feel” created by fonts, shapes, colour schemes, etc. Image features should be investigated in a broader view of the matching and categorization tasks.

It is unclear how the methods would behave for other products, in particular experience goods and credence goods. (I would expect word embeddings to work very well for these, since marketing speak needs to envelop customers in a discourse that plays on the to-be-expected experience and/or the to-be-given credence. - I have no idea how image features would play out for such goods.)

It is also unclear how scientists and practitioners not interested in the specific application considered here (product aggregators) can profit from the described methods and results. In other words: For what other questions are these methods and findings valuable, and why? The “use case” of Section 6 goes in this direction, although it still sits squarely in the advertising domain.

Most importantly, it is unclear when and how the proposed method produces errors, and what can be learned from this.

Thus, in general, to reach higher significance, the method should be tested for more diverse settings, the authors should not strive only for “better evaluation measures” but also for “interesting failures and in-depth error analyses”, more diverse application areas should at least be discussed, and a section on limitations added.


The writing in general is good, although some typos and grammatical errors remain. Samples, please carefully check for further errors:

In the recent years → in recent years
empiric → empirical
Remove trailing closing brackets in the list in 3.1.
Last sentence on p.5 is not a sentence.
Within “p1.val(f)”, the formatting changes. This is extremely hard to read.


You talk about “significant differences” in values (e.g. on p.9), but I don’t see evidence of statistical tests, or their results?

The PCA description and purpose are unclear. How is the PCA set up, what components did you find, were these just used to produce Fig.4 or also as input for further analyses? How and why did you “select several attribute-value pairs”?

The use case of Section 6 is interesting, but somewhat inconclusive. What would these results imply for an application? Would a vendor be happy to have an advertising intermediary add content to the product description that they provide (maybe there’s a reason for not providing all the information? And what if it is faulty? What if it makes the user go to the product aggregator or even a competitor?)? What is the “viewing pipeline” that you assume a web user / potential customer goes through?

Review #3
Anonymous submitted on 15/Jan/2017
Major Revision
Review Comment:

The paper tackles the product integration problem for e-shops. Specifically, it enhances
a recent ESWC 2016 approach using neural learning approaches: a simple neural language model and a
deep network (a CNN) respectively to reduce the textual resp. image data for product
matching an categorisation.

The use of neural learning methods for product matching an categorisation is interesting.
Overall the paper is well written and structured. It is easy to follow and the experimental
results show that embedding features help to improve F1 scores. The take-away message
that (deep) neural features can improve product matching an categorisation is nice for
the community, although to be expected from a pure machine learning perspective.

However, there are also some downsides that should be clarified before publication.

The authors state at several places that improvement are significant without stating, which
significance test was used. Actually, the experimental protocol for the first experiments
is not mentioning cross-validation are some random reruns at all. For the later
experiments, the authors mention cross-validation within the RapiMiner environment.
This is confusing as the feature extraction and embeddings should be computed within
the cross-validation in order to avoid feedback loops from the test bins. Whether this
was done or not is unfortunately unclear. Consequently, the authors should provide more details on
the experimental protocol. This is really critical.

Second, the related work section should be extended. For instance, for the image-based embeddings

Sean Bell, Kavita Bala:
Learning visual similarity for product design with convolutional neural networks.
ACM Trans. Graph. 34(4): 98:1-98:10 (2015)

Xi Wang, Zhenfeng Sun, Wenqiang Zhang, Yu Zhou, Yu-Gang Jiang:
Matching User Photos to Online Products with Robust Deep Features. ICMR 2016: 7-14

M. Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg:
Where to Buy It: Matching Street Clothing Photos in Online Shops. ICCV 2015: 3343-3351

and the many other approaches should be discussed or a link to recent overview should be provided.
While these approaches may not directly use deep learning the
way it is proposed in the present paper, the authors should present a broader discussion of
deep learning for product search in the related work section.

Likewise, for the neural language model the authors should discuss alternatives such as
(very recent and “only” a workshop paper but it also tackles the “normalisation” issue
also faced by the present paper).

Paul Neculoiu, Maarten Versteegh, Mihai Rotaru:
Learning Text Similarity with Siamese Recurrent Networks.
Proceedings of the 1st Workshop on Representation Learning for NLP, pages 148–157,
at ACL 2016

In general, the related work section should reviewer deep learning for related tasks in more

This should also bemused to sheet some more light on the design issues made by the authors.
The (deep) neural networks are not the latest one, which is of course completely fine
for the purpose of the paper but it should be discussed.

Finally, the authors should make clear to not use the work “deep” too much. For instance,
please note that word2vec is not really deep:

Omer Levy, Yoav Goldberg:
Neural Word Embedding as Implicit Matrix Factorization.
NIPS 2014: 2177-2185

Consequently, the authors should downscale here and there the use of “deep” to avoid giving the
wrong impression that everything is deep.

With addressing these issues, however, I think this paper should be accepted as the use of
(deep) neural features within the semantic web community is interesting.