Accessibilité Contact Aller au menu Aller au texte

RetroWeb, the open-source application for data extraction on the Web

Intervenant(s) : Fabrice Estiévenart
Langue : Français Niveau : Confirmé Type d'événement : Conférence
Date : Jeudi 8 juillet 2010 Horaire : 15h00 Durée : 20 minutes
Lieu : ENSEIRB - Amphi A

Retroweb is an open-source application for data extraction on the Web. It is used to build rapidly robust and efficient web wrappers thanks to its visual interface. Retroweb generated wrappers extract data from web pages and convert it in structured and semantic information. Such wrappers are used to feed information to a document management tool or to any other enterprise database. In addition, Retroweb can easily be integrated into a search engine and a competitive intelligence application. It can as well help to migrate a web site towards a content management system.

Two complementeray modules compose the architecture of Retroweb : Retroweb-Browser is the GUI for the visual definition of extraction rules Retroweb-Wrapper rely on the extraction rules in order to extract web data towards an XML-based format. In the context of web intelligence or monitoring, that process can be periodically repeated. Technically, Retroweb is a Java 6 application based on the Eclipse RCP framework. The web rendering engine is Gecko (also used in Firefox) and the extraction rules are based on XPath, a W3C standard. Retroweb is built on an MVC architecture. It allows a shortened size of the source code and an easier maintenance and evolution.

Retroweb is currently hosted on PALLAVI, a software forge deployed and maintained by CETIC in the context of its project called CELLAVI (Centre d’Expertise en Logiciel Libre à Vocation Industrielle). One of the main objective of CELLAVI is to guide companies in their evaluation, selection and adoption of open-source applications and libraries.

First, my presentation will explain the main challenges behind web data extraction. I will describe our tool-supported methodology. Second, a case study will illustrate how to extract structured information from a discussion forum.

About the author

Graduated from the University of Namur (Belgium), Fabrice Estiévenart is currently a senior research engineer at CETIC (Centre d’Excellence en Technologies de l’Information et de la Communication) within the ICS (Intelligent Content and Semantic) team. He has developed Retroweb, an open-source application for web data extraction and has acquired a great expertise in unstructured data management and search engines. Most of its work rely on open-source technologies such as Lucene, Nutch or Solr.