An Analytical Approach to Concept Extraction in HTML Environments

Authors:
Victor Fresno;Angela Ribeiro
Affiliations:
Escuela Superior de Ciencias Experimentales y Tecnologia, Rey Juan Carlos University, 28933 Mostoles, Madrid, Spain. v.fresno@escet.urjc.es;Industrial Automation Institute (IAI), Spanish Council for Scientific Research (CSIC), 28500 Arganda del Rey, Madrid, Spain. angela@iai.csic.es
Venue:
Journal of Intelligent Information Systems
Year:
2004

Citing 11
Cited 1

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Bringing order to the Web: automatically categorizing search results

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
A vector space model for automatic indexing

Communications of the ACM
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Data Mining: Introductory and Advanced Topics

Data Mining: Introductory and Advanced Topics
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Text-Learning and Related Intelligent Agents: A Survey

IEEE Intelligent Systems
Information Retrieval on the World Wide Web

IEEE Internet Computing
Automated text summarization and the SUMMARIST system

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998

Fuzzy combinations of criteria: an application to web page representation for clustering

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

The core of the Internet and World Wide Web revolution comes from their capacity to efficiently share the huge quantity of data, but the rapid and chaotic growth of the Net has extremely complicated the task of sharing or mining useful information. Each inference process, from Internet information, requires an adequate characterization of the Web pages. The textual part of a page is one of the most important aspects that should be considered to appropriately perform a page characterization. The textual characterization should be made through the extraction of an appropriate set of relevant concepts that properly represent the text included in the Web page. This paper presents a method to obtain such a set of relevant concepts from a Web page, essentially based on a relevance estimation of each word in the text of a Web page. The word-relevance is defined by a combination of criteria that take into account characteristics of the HTML language as well as more classical measures such as the frequency and the position of a word in a document. Besides, heuristic rules to obtain the most suitable fusion of criteria is achieved via a statistical study. Several experiments are conducted to test the performance of the proposed concept extraction method compared to other approaches including a commercial tool. The results obtained here exhibit a greater success in the concept extraction by the proposed technique against other tested methods.