Enhancing automatic term recognition algorithms with HTML tags processing

  • Authors:
  • Milan Lučanský;Marián Šimko;Mária Bieliková

  • Affiliations:
  • Slovak University of Technology, Ilkovičova, Bratislava, Slovakia;Slovak University of Technology, Ilkovičova, Bratislava, Slovakia;Slovak University of Technology, Ilkovičova, Bratislava, Slovakia

  • Venue:
  • Proceedings of the 12th International Conference on Computer Systems and Technologies
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We focus on mining relevant information from web pages. Unlike plain text documents, web pages contain another source of potentially relevant information - easily processable mark-up. We propose an approach to keyword extraction that enhances Automatic Term Recognition (ATR) algorithms intended for processing plain text documents with an analysis of HTML tags present in the document. We distinguish tags that have a semantic potential. We present results of an experiment we conducted on a set of Wikipedia pages. It shows that enhancement yields better results than using ATR algorithms alone.