SML-Bench – A Benchmarking Framework for Structured Machine Learning

Tracking #: 1603-2815

Patrick Westphal
Lorenz Bühmann
Simon Bin
Hajira Jabeen
Jens Lehmann

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Tool/System Report
The availability of structured data has increased significantly over the past decade and several approaches to learn from structured data have been proposed. The proposed logic-based, inductive learning methods are often conceptually similar, which would allow a comparison among them even if they stem from different research communities. However, so far no efforts were made to define an environment for running learning tasks on a variety of tools, covering multiple knowledge representation languages. With SML-Bench, we propose a benchmarking framework to run inductive learning tools from the ILP and semantic web communities on a selection of learning problems. In this paper, we present the foundations of SML-Bench, discuss the systematic selection of benchmarking datasets and learning problems, and showcase an actual benchmark run on the currently supported tools.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By George Papastefanatos submitted on 06/Jun/2017
Major Revision
Review Comment:

This paper presents the SML-Bench, a benchmarking framework for evaluating inductive learning tools from ILP and the semantic web. The authors have conducted a thorough study in the related literature for identifying candidate datasets and classification learning scenarios and have applied the benchmark for evaluating the accuracy of 8 learning tools. The authors present first the challenges for creating such a benchmark, then provide an overview of the architecture with technical implementation details and finally present the experimental results of the evaluation. Overall, as many ML efforts exist in the area of the semantic web and the DL, the problem addressed in this paper is very challenging and the results are promising. Below are some more focused comments.

(1) Quality, importance, and impact of the described tool or system
This paper presents a tool for benchmarking structured ML tools. On the positive aspects, the authors have conducted a detailed study for finding candidate datasets and scenarios for ML benchmarking tasks, they have provided a detailed evaluation study on 8 tools and they have implemented a framework that can be potentially extended for benchmarking other tools as well. On the negative aspects, there is a lack of clear goals and elements of the benchmark regarding the quality factors that are measured and used for comparison; besides the accuracy, there is no assessment about the performance (e.g., time) for completing a task, and about the scalability of each tool. Also, the learning problems refer only to classification tasks; it should be explicitly mentioned from the intro that this benchmark does not address other types of ML tasks.
(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.
The paper is well written, although many concepts and assumptions must be explained in more details, possibly through examples. In details,

Intro, p1, par2. Explain better what symbolic machine learning is, possibly by enriching the example (for the chemical compounds) with a) example rules (e.g, OWL syntax) and b) example algos (narrative) used for the classification. Mentioning “…using some algorithms…” and “…HORN rules…” does not help the reader understand the problem addressed. Also, mention and justify why only classification tasks are considered.
Intro. p.1, par1. Please put refs to the benchmarks mentioned in the intro.
Intro, p2, par1. The 2nd reason for the effort required to model background knowledge is not adequately justified. Why? OWL is widely used for this reason.
RW. Although the paper states some limitations of the SoA benchmarks (e.g., “…data is in tabular format…”, …”the provided datasets are not sufficiently structured…”, etc), there is no clear comparison between SML-Bench and the benchmarks referenced in this section and the contributions wrt the existing efforts. I would expect to identify a set of dimensions (e.g., size of datasets, lack of or existence of schema, ML problems benchmarked, etc) on which you compare with other benchmarks and summarize them in a table.
Section3. I would expect in this section (or after it) to include a section with the basic elements of the benchmark. Usually, benchmarks contain the data, the tasks (e.g., queries, ML tasks,etc) used for evaluation, the parameters for configuration, and the quality factors – goals to be measured\compared. For the latter, what are the measures that you consider?
Section 4, “Paper is available and Availability of the datasets”. Both criteria are trivial to consider\mention in your methodology. Please also state why you do not produce artificial datasets \ learning scenarios (for testing various parameters of the benchmark, such as scalability)
Section 4, Derivable Inductive Learning. Please explain whether the final review \ selection of all pubs and datasets was performed manually. How did you assess that the dataset represents an inductive learning scenario? Perhaps you could include an explanation sentence, on why only 11 out of 805 datasets of Table 2 have been selected.
Section 4, Table 3, Put a column with the origin or ref to paper for each dataset used in SML-Bench
Section 5, Please explain the extensibility of your framework for assessing other tools, what are the configuration steps?
Section5, p8, par1. Omit typo (positives)
Section 6. Please include an intro with the goals \ overview of the evaluation, wrt the datasets, the tasks, the measures and the tools tested (with any configuration applied). Is the goal of this section to evaluate the tools or the SML benchmark itself?
Section 6. Please explain why you do not consider performance or scalability assessment in your benchmark.

Review #2
Anonymous submitted on 17/Oct/2017
Review Comment:

The paper is proposing a benchmarking framework for the evaluation of
inductive learning tools. The main contribution is a benchmarking framework that
consists of nine datasets selected by reviewing published researched papers
on relevant venues and journals and a set of relevant learning tasks. The creation
of this framework allowed the comparison of various methods that may use different
knowledge representation languages, such as OWL and Prolog. To achieve that the
authors converted the nine datasets to the various KR formats and they established
a convention folder scheme for extending the framework with other learning tasks
and learning tools.

The quality of the presentation is good and the paper can be followed easily.
The Semantic Web community can benefit from the existence of such a
benchmark framework. More specifically, the conversion of datasets to OWL
and the inclusion of other inductive learning tools will enable the more
comprehensive comparison with tools that work on OWL.

I recommend acceptance.

Some minor comments:

- Section 2 (Related Work) is overloaded with benchmark propositions
that are either out of the scope of SML-Bench or they benchmark the database
systems and not the machine learning tools. I recommend to organise them into
subsections and highlight the most relevant benchmarks.

Review #3
Anonymous submitted on 22/Oct/2017
Review Comment:

In this paper, the authors present a benchmarking framework for structured machine learning, called SML-Bench. The paper is a result of a substantial amount of effort, given the extensive literature scanning for identifying suitable datasets for SML-Bench, along with the data conversion efforts. However, even though the methodology is sound, the results have several explicit drawbacks, acknowledged by the authors.

Otherwise, the authors provide a clear goal at the beginning of the paper. The paper is very well written, is quite clear and easy to follow.

Section 2 offers a good and broad overview of existing approaches and solutions, and positions their approach as sufficiently unique, to the best of their knowledge.

Section 3 provides an overview of several machine learning challenges in the context of structured data, a.k.a. choke points.

Section 4 describes the methodology for discovering, selecting and preparing datasets to be used by SML-Bench. The methodology is sound, but unfortunately the authors currently only have 9 datasets at their disposal for the benchmark. This state is understandable, as the dataset preparation is a high effort activity. The effort to identify the datasets in the first place, is a good result - I'd urge the authors to publish the full list of candidate datasets, for reuse purposes.

Section 5 describes the detailed implementation of the framework. The architecture, as well as the scenarios, tools and settings are well described, and the engineering decisions within them are good.

Section 6 presents the evaluation of SML-Bench. Here, the authors identify several drawbacks and issues of / with the selected tools. Namely, most of them do not run very well with their default settings, i.e. out-of-the-box. This is understandable, as they usually need to be tuned to the specific learning task at hand. But, in order to compare them with 'tweaked' parameters, the parameters of all tools need to be optimized -- for which the authors currently await feedback from the developers and the community, as pointed out in Section 7 and 8.

With this, the SML-Bench frameworks proves to be an important part of the domain, as it can impact the advancements and innovation in it. The domain of structured machine learning lacks means of benchmarking the variety of tools and KR languages available, so this framework may initiate further innovation in an otherwise mature field.