Text Mining: A New Frontier for Lossless Compression

Authors:
Ian H. Witten;Zane Bray;Malika Mahoui;Bill Teahan
Affiliations:
-;-;-;-
Venue:
DCC '99 Proceedings of the Conference on Data Compression
Year:
1999

Citing 0
Cited 14

Towards a digital library of popular music

Proceedings of the fourth ACM conference on Digital libraries
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
Browsing around a Digital Library: Today and Tomorrow

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Combining PPM Models Using A Text Mining Approach

DCC '01 Proceedings of the Data Compression Conference
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Language independent authorship attribution using character level language models

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Extracting key-substring-group features for text classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying synonymous concepts in preparation for technology mining

Journal of Information Science
Overview and semantic issues of text mining

ACM SIGMOD Record
A new ppm variant for chinese text compression

Natural Language Engineering
Identification of gene function using prediction by partial matching (PPM) language models

Proceedings of the 17th ACM conference on Information and knowledge management
On prediction using variable order Markov models

Journal of Artificial Intelligence Research
MALEF: Framework for distributed machine learning and data mining

International Journal of Intelligent Information and Database Systems
Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data mining, a burgeoning new technology, is about looking for patterns in data. Likewise, text mining is about looking for patterns in text. It may be defined as the process of analyzing text to extract information that is useful for particular purposes. Compared with the kind of data stored in databases, text is unstructured, amorphous, and difficult to deal with. Nevertheless, in modern Western culture, text is the most common vehicle for the formal exchange of information. The motivation for trying to extract information from it is compelling-even if success is only partial.Analysis of natural language text is commonly thought of as a problem for artificial intelligence. And ever since extravagant claims for mechanical translation in the 1960s prompted an "AI winter" of despair and disillusionment, mainstream computer scientists have-understandably-been skeptical of claims for automatic natural language understanding. The most advanced efforts still rely on tightly-focused domains, small vocabularies, and quantities of specialist domain knowledge, painstakingly programmed in-and still the resulting systems are distressingly brittle. Whether contemporary attempts to codify "common-sense knowledge" (e.g. Lenat, 1995) will make much of a difference remains to be seen. Although corpus-driven, statistical language analysis (e.g. Garside et al., 1987) represents a promising approach for producing robust parsers, it does not help in putting the structures that are extracted to any use.Text mining is possible because you do not have to understand text in order to extract useful information from it. Here are four examples. First, if only names could be identified, links could be inserted automatically to other places that mention the same name-links that are "dynamically evaluated" by calling upon a search engine to bind them at click time. Second, actions can be associated with different types of data, using either explicit programming or programming-by-demonstration techniques. A day/time specification appearing anywhere within one's email could be associated with diary actions such as updating a personal organizer or creating an automatic reminder, and each mention of a day/time in the text could raise a popup menu of calendar-based actions. Third, text could be mined for data in tabular format, allowing databases to be created from formatted tables such as stock-market information on Web pages. Fourth, an agent could monitor incoming newswire stories for company names and collect documents that mention them-an automated press clipping service.This paper aims to promote text compression as a key technology for text mining.