Accessibility Contact Go to main menu Go to main content

RetroWeb, the open-source application for data extraction on the Web

Speaker(s) : Fabrice Estiévenart
Language : Français Level : Confirmed Nature : Conference
Date : Thursday 8 July 2010 Schedule : 15h00 Duration : 20 minutes
Place: ENSEIRB - Amphi A

Retroweb is an open-source application for data extraction on the Web. It is used to build rapidly robust and efficient web wrappers thanks to its visual interface. Retroweb generated wrappers extract data from web pages and convert it in structured and semantic information. Such wrappers are used to feed information to a document management tool or to any other enterprise database. In addition, Retroweb can easily be integrated into a search engine and a competitive intelligence application. It can as well help to migrate a web site towards a content management system.

Two complementeray modules compose the architecture of Retroweb : Retroweb-Browser is the GUI for the visual definition of extraction rules Retroweb-Wrapper rely on the extraction rules in order to extract web data towards an XML-based format. In the context of web intelligence or monitoring, that process can be periodically repeated. Technically, Retroweb is a Java 6 application based on the Eclipse RCP framework. The web rendering engine is Gecko (also used in Firefox) and the extraction rules are based on XPath, a W3C standard. Retroweb is built on an MVC architecture. It allows a shortened size of the source code and an easier maintenance and evolution.

Retroweb is currently hosted on PALLAVI, a software forge deployed and maintained by CETIC in the context of its project called CELLAVI (Centre d’Expertise en Logiciel Libre à Vocation Industrielle). One of the main objective of CELLAVI is to guide companies in their evaluation, selection and adoption of open-source applications and libraries.

First, my presentation will explain the main challenges behind web data extraction. I will describe our tool-supported methodology. Second, a case study will illustrate how to extract structured information from a discussion forum.

About the author

Graduated from the University of Namur (Belgium), Fabrice Estiévenart is currently a senior research engineer at CETIC (Centre d’Excellence en Technologies de l’Information et de la Communication) within the ICS (Intelligent Content and Semantic) team. He has developed Retroweb, an open-source application for web data extraction and has acquired a great expertise in unstructured data management and search engines. Most of its work rely on open-source technologies such as Lucene, Nutch or Solr.