FitLayout: An RDF-Based Framework and Toolkit for Web Page Content Analysis

Tracking #: 3217-4431

This paper is currently under review
Radek Burget
Hamza Salem

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Despite the ongoing development of technologies that allow the publication of structured information within web pages, there still exist many web sources that publish useful data solely in the form of plain HTML documents. The data published in this way is difficult to extract and integrate with other data sets. One of the promising options is to analyze the visual presentation of data within the web page, but this is quite a complex task from an implementation point of view. In this paper, we present FitLayout, a framework and open-source toolkit that implements a web page processing workflow consisting of an arbitrary number of steps during which all page information is represented using a unified RDF model. This model contains detailed information about the visual appearance of the page and each of its content elements, as well as the results of analytical steps such as page segmentation. This allows easy archiving of all the details of rendered pages, generation of annotated data sets from web pages for different purposes, and their integration with other linked data. FitLayout also provides a platform for easy implementation of vision-based page analysis algorithms and includes ready-to-use implementations of algorithms for page segmentation, identification of important content elements, and others.
Full PDF Version: 
Under Review