Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Authors:
Evgeniy Gabrilovich;Shaul Markovitch
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2007

Citing 0
Cited 17

Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning

ISWC '08 Proceedings of the 7th International Conference on The Semantic Web
Classifying search queries using the Web as a source of knowledge

ACM Transactions on the Web (TWEB)
Data mining of maps and their automatic region-time-theme classification

SIGSPATIAL Special
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
Improving text categorization bootstrapping via unsupervised learning

ACM Transactions on Speech and Language Processing (TSLP)
Wikipedia-based semantic interpretation for natural language processing

Journal of Artificial Intelligence Research
Automatic content-based categorization of Wikipedia articles

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Measuring intrinsic quality of semantic search based on feature vectors

International Journal of Metadata, Semantics and Ontologies
Relating ontology and web terminologies by feature vectors: unsupervised construction and experimental validation

Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services
Constructing Feature Vectors for search: investigating intrinsic quality impact on search performance

International Journal of Web and Grid Services
Web page classification on child suitability

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Mobile web search personalization using ontological user profile

Proceedings of the 48th Annual Southeast Regional Conference
A combined topical/non-topical approach to identifying web sites for children

Proceedings of the fourth ACM international conference on Web search and data mining
A probabilistic approach to semantic collaborative filtering using world knowledge

Journal of Information Science
A semantic term weighting scheme for text categorization

Expert Systems with Applications: An International Journal
Local and global algorithms for disambiguation to Wikipedia

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing---synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation.