Identifying document topics using the Wikipedia category network

Authors:
Peter Schönhofen
Affiliations:
(E-mail: schonhofen@ilab.sztaki.hu) Computer and Automation Research Institute, Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary
Venue:
Web Intelligence and Agent Systems
Year:
2009

Citing 24
Cited 4

WordNet: a lexical database for English

Communications of the ACM
Modern Information Retrieval

Modern Information Retrieval
Topic Identification in Dynamical Text by Complexity Pursuit

Neural Processing Letters
TopCat: Data Mining for Topic Identification in a Text Corpus

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Web Document Classification Based on Fuzzy Association

COMPSAC '02 Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
Robust automated topic identification

Robust automated topic identification
Topic identification in discourse

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Knowledge-based automatic topic identification

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
ConceptNet — A Practical Commonsense Reasoning Tool-Kit

BT Technology Journal
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Discovering missing links in Wikipedia

Proceedings of the 3rd international workshop on Link discovery
Semantic Wikipedia

Proceedings of the 15th international conference on World Wide Web
The Wikipedia XML corpus

ACM SIGIR Forum
Analyzing and visualizing the semantic coverage of Wikipedia and its authors: Research Articles

Complexity
A Thesaurus Construction Method from Large ScaleWeb Dictionaries

AINA '07 Proceedings of the 21st International Conference on Advanced Networking and Applications
Ontology construction and concept reuse with formal concept analysis for improved web document retrieval

Web Intelligence and Agent Systems
Mining world knowledge for analysis of search engine content

Web Intelligence and Agent Systems
Identifying a hierarchy of bipartite subgraphs for web site abstraction

Web Intelligence and Agent Systems
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
WikiRelate! computing semantic relatedness using wikipedia

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
The problem of ontology alignment on the web: a first report

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Automatic extraction of semantic relationships for wordnet by means of pattern learning from wikipedia

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Automatic assignment of wikipedia encyclopedic entries to wordnet synsets

AWIC'05 Proceedings of the Third international conference on Advances in Web Intelligence

Adapting recommender systems to the requirements of personal health record systems

Proceedings of the 1st ACM International Health Informatics Symposium
CATE: context-aware timeline for entity illustration

Proceedings of the 20th international conference companion on World wide web
UPS: efficient privacy protection in personalized web search

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Interfacing virtual agents with collaborative knowledge: open domain question answering using wikipedia-based topic models

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the last few years the size and coverage of Wikipedia, a community edited, freely available on-line encyclopedia has reached the point where it can be effectively used to identify topics discussed in a document, similarly to an ontology or taxonomy. In this paper we will show that even a fairly simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and also by performing classification and clustering on 20 Newsgroups and RCV1, representing documents by their Wikipedia categories instead of (or in addition to) their texts. Support from NKFP projects MOLINGV and Language Miner.