Distribution of content words and phrases in text and language modelling

Authors:
Slava M. Katz
Affiliations:
Weston Language Research, 138 Weston Road, Weston, CT 06883, USA. E-mail: 72613.2401@compuserve.com
Venue:
Natural Language Engineering
Year:
1996

Citing 3
Cited 52

A dynamic language model for speech recognition

HLT '91 Proceedings of the workshop on Speech and Natural Language
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Improvements in stochastic language modeling

HLT '91 Proceedings of the workshop on Speech and Natural Language

Word document density and relevance scoring (poster session)

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A new approach to unsupervised text summarization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Fast Extraction of Semantic Features from a Latent Semantic Indexed Text Corpus

Neural Processing Letters
Topic Identification in Dynamical Text by Complexity Pursuit

Neural Processing Letters
Collocation Discovery for Optimal Bilingual Lexicon Development

AI '00 Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
The diversity-based approach to open-domain text summarization

Information Processing and Management: an International Journal
Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Statistical models for topic segmentation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A probabilistic model for Latent Semantic Indexing: Research Articles

Journal of the American Society for Information Science and Technology
Distribution-based pruning of backoff language models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Extracting significant words from corpora for ontology extraction

Proceedings of the 3rd international conference on Knowledge capture
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Empirical term weighting and expansion frequency

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Language independent NER using a unified model of internal and contextual evidence

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Combining optimal clustering and Hidden Markov models for extractive summarization

MultiSumQA '03 Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering - Volume 12
Summary in context: Searching versus browsing

ACM Transactions on Information Systems (TOIS)
One story, one flow: Hidden Markov Story Models for multilingual multidocument summarization

ACM Transactions on Speech and Language Processing (TSLP)
Pushing task relevant web links down to the desktop

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Summarizing local context to personalize global web search

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Incremental hierarchical clustering of text documents

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Adaptive Chinese word segmentation

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Bootstrapping without the boot

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Employing web mining and data fusion to improve weak ad hoc retrieval

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Inference and evaluation of the multinomial mixture model for text clustering

Information Processing and Management: an International Journal
Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization

Journal of Visual Communication and Image Representation
A study of Poisson query generation model for information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Document concept lattice for text understanding and summarization

Information Processing and Management: an International Journal
Discrete data clustering using finite mixture models

Pattern Recognition
Text classification based on multi-word with support vector machine

Knowledge-Based Systems
Web service clustering using text mining techniques

International Journal of Agent-Oriented Software Engineering
Document relevance assessment via term distribution analysis using fourier series expansion

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Statistical properties of inter-arrival times distribution in social tagging systems

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Improving Legal Document Summarization Using Graphical Models

Proceedings of the 2006 conference on Legal Knowledge and Information Systems: JURIX 2006: The Nineteenth Annual Conference
An improved hierarchical Bayesian model of language for document classification

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A probabilistic framework for automatic term recognition

Intelligent Data Analysis
Terminology mining in social media

Proceedings of the 18th ACM conference on Information and knowledge management
A Bayesian mixture model for term re-occurrence and burstiness

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Word distribution analysis for relevance ranking and query expansion

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
The BNB distribution for text modeling

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Document update summarization using incremental hierarchical clustering

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Identification of rhetorical roles for segmentation and summarization of a legal judgment

Artificial Intelligence and Law
A K-mixture connective-strength-based approach to automatic text summarisation

International Journal of Intelligent Systems Technologies and Applications
Modeling term proximity for probabilistic information retrieval models

Information Sciences: an International Journal
A technique for improving the performance of naive bayes text classification

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Weighting query terms based on distributional statistics

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Automatic sentiment classification of product reviews using maximal phrases based analysis

WASSA '11 Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis
Techniques for improving the performance of naive bayes for text classification

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
A bibliometric-based semi-automatic approach to identification of candidate thesaurus terms: parsing and filtering of noun phrases from citation contexts

CoLIS'05 Proceedings of the 5th international conference on Context: conceptions of Library and Information Sciences
Correlation-based burstiness for logo retrieval

Proceedings of the 20th ACM international conference on Multimedia
WNavis: Navigating Wikipedia semantically with an SNA-based summarization technique

Decision Support Systems
Size matters: finding the most informative set of window lengths

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of distribution of words and phrases in text, a problem of great general interest and of importance for many practical applications. The existing models for word distribution present observed sequences of words in text documents as an outcome of some stochastic processes; the corresponding distributions of numbers of word occurrences in the documents are modelled as mixtures of Poisson distributions whose parameter values are fitted to the data. We pursue a linguistically motivated approach to statistical language modelling and use observable text characteristics as model parameters. Multi-word technical terms, intrinsically content entities, are chosen for experimentation. Their occurrence and the occurrence dynamics are investigated using a 100-million word data collection consisting of a variety of about 13,000 technical documents. The derivation of models describing word distribution in text is based on a linguistic interpretation of the process of text formation, with the probabilities of word occurrence being functions of observable and linguistically meaningful text characteristics. The adequacy of the proposed models for the description of actually observed distributions of words and phrases in text is confirmed experimentally. The paper has two focuses: one is modelling of the distributions of content words and phrases among different documents; and another is word occurrence dynamics within documents and estimation of corresponding probabilities. Accordingly, among the application areas for the new modelling paradigm are information retrieval and speech recognition.