Automatic term categorization by extracting knowledge from the Web

Authors:
Leonardo Rigutini;Ernesto Di Iorio;Marco Ernandes;Marco Maggini
Affiliations:
Dipartimento di Ingegneria dell'Informazione, Università di Siena, Via Roma 56, I-53100-Siena-Italy. {rigutini,diiorio,ernandes,maggini}@dii.unisi.it;Dipartimento di Ingegneria dell'Informazione, Università di Siena, Via Roma 56, I-53100-Siena-Italy. {rigutini,diiorio,ernandes,maggini}@dii.unisi.it;Dipartimento di Ingegneria dell'Informazione, Università di Siena, Via Roma 56, I-53100-Siena-Italy. {rigutini,diiorio,ernandes,maggini}@dii.unisi.it;Dipartimento di Ingegneria dell'Informazione, Università di Siena, Via Roma 56, I-53100-Siena-Italy. {rigutini,diiorio,ernandes,maggini}@dii.unisi.it
Venue:
Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Year:
2006

Citing 11
Cited 1

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Estimating the Generalization Performance of an SVM Efficiently

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Expanding domain-specific lexicons by term categorization

Proceedings of the 2003 ACM symposium on Applied computing
The role of domain information in Word Sense Disambiguation

Natural Language Engineering
Exogeneous and endogeneous approaches to semantic categorization of unknown technical terms

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Word-sense disambiguation using statistical models of Roget's categories trained on large corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Positioning unknown words in a thesaurus by using information extracted from a corpus

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
An EM Based Training Algorithm for Cross-Language Text Categorization

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Supersense tagging of unknown nouns in WordNet

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
WebCrow: a WEB-based system for crossword solving

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3

Semantic Labeling of Data by Using the Web

WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of categorizing terms or lexical entities into a predefined set of semantic domains exploiting the knowledge available on-line in the Web. The proposed system can be effectively used for the automatic expansion of thesauri, limiting the human effort to the preparation of a small training set of tagged entities. The classification of terms is performed by modeling the contexts in which terms from the same class usually appear. The Web is exploited as a significant repository of contexts that are extracted by querying one or more search engines. In particular, it is shown how the required knowledge can be obtained directly from the snippets returned by the search engines without the overhead of document downloads. Since the Web is continuously updated “World Wide”, this approach allows us to face the problem of open-domain term categorization handling both the geographical and temporal variability of term semantics. The performances attained by different text classifiers are compared, showing that the accuracy results are very good independently of the specific model, thus validating the idea of using term contexts extracted from search engine snippets. Moreover, the experimental results indicate that only very few training examples are needed to reach the best performance (over 90% for the F1 measure).