Information extraction from web documents based on local unranked tree automaton inference

  • Authors:
  • Raymond Kosala;Maurice Bruynooghe;Jan Van Den Bussche;Hendrik Blocked

  • Affiliations:
  • K.U.Leuven, Dept. of Computer Science, Celestijnenlaan, Leuven;K.U.Leuven, Dept. of Computer Science, Celestijnenlaan, Leuven;University of Limburg, Dept. WNI, Diepenbcek;K.U.Leuven, Dept. of Computer Science, Celestijnenlaan, Leuven

  • Venue:
  • IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on 10 from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.