Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model

  • Authors:
  • Susan Mengel;Yaoquin Jing

  • Affiliations:
  • Computer Science, Texas Tech University, Lubbock;Computer Science, Texas Tech University, Lubbock

  • Venue:
  • WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automated techniques can help to extract information from the Web. A new semi-automatic approach based on the maximum entropy segmental Markov model, therefore, is proposed to extract structured data from Web pages. It is motivated by two ideas: modeling sequences embedding structured data instead of their context to reduce the number of training Web pages and preventing the generation of too specific or too general models from the training data. The experimental results show that this approach has better performance than Stalker when only one training Web page is provided.