TaxaMiner: an experimentation framework for automated taxonomy bootstrapping

Authors:
Vipul Kashyap;Cartic Ramakrishnan;Christopher Thomas;A. Sheth
Affiliations:
Clinical Informatics R&D, Partners HealthCare System, 93 Worcester St., Wellesley, MA 02481,USA.;LSDIS Lab, Department of Computer Science, University of Georgia, 415 GSRC, Athens, GA 30602, USA.;LSDIS Lab, Department of Computer Science, University of Georgia, 415 GSRC, Athens, GA 30602, USA.;LSDIS Lab, Department of Computer Science, University of Georgia, 415 GSRC, Athens, GA 30602, USA
Venue:
International Journal of Web and Grid Services
Year:
2005

Citing 24
Cited 13

Clustering algorithms

Information retrieval
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
The ups and downs of lexical acquisition

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Using linear algebra for intelligent information retrieval

SIAM Review
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Deriving concept hierarchies from text

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Annotea: an open RDF infrastructure for shared Web annotations

Proceedings of the 10th international conference on World Wide Web
Thematic mapping - from unstructured documents to taxonomies

Proceedings of the eleventh international conference on Information and knowledge management
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Ontology Learning for the Semantic Web

IEEE Intelligent Systems
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Text Mining Techniques to Automatically Enrich a Domain Ontology

Applied Intelligence
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
On deep annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Data mining for hypertext: a tutorial survey

ACM SIGKDD Explorations Newsletter
Generating hierarchical summaries for web searches

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Bootstrapping an ontology-based information extraction system

Intelligent exploration of the web
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Cluster Analysis

Cluster Analysis

Using tagflake for condensing navigable tag hierarchies from tag clouds

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Growing finely-discriminating taxonomies from seeds of varying quality and size

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Putting things in context: a topological approach to mapping contexts to ontologies

Journal on data semantics IX
Multimedia summarization in law courts: a clustering-based environment for browsing and consulting judicial folders

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Building a topic hierarchy using the bag-of-related-words representation

Proceedings of the 11th ACM symposium on Document engineering
UP-DRES: user profiling for a dynamic REcommendation system

ICDM'06 Proceedings of the 6th Industrial Conference on Data Mining conference on Advances in Data Mining: applications in Medicine, Web Mining, Marketing, Image and Signal Mining
A hierarchical document clustering environment based on the induced bisecting k-means

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Agent oriented data integration

ER'05 Proceedings of the 24th international conference on Perspectives in Conceptual Modeling
Identifying the multiple contexts of a situation

MRC'05 Proceedings of the Second international conference on Modeling and Retrieval of Context
Identification of trends from patents using self-organizing maps

Expert Systems with Applications: An International Journal
Ranking Algorithm for Semantic Document Annotations

International Journal of Information Retrieval Research
Ontology learning: revisted

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Construction of domain ontologies on the semantic web is a human and resource intensive process, efforts to reduce which are crucial for the Semantic Web to scale. We present a framework for automated taxonomy construction, that involves: (a) generation of a cluster hierarchy from a document corpus using statistical clustering and NLP techniques; (b) extraction of a topic hierarchy from this cluster hierarchy; and (c) assignment of labels to nodes in the topic hierarchy. Metrics for estimating topic hierarchy quality and parameters of an experimentation framework are identified. MEDLINE was the document corpus and MeSH thesaurus was the gold standard.