A probabilistic, integrative approach for improved natural language disambiguation

Authors:
Gerald Che-Shun Chao;Michael G. Dyer
Affiliations:
-;-
Venue:
A probabilistic, integrative approach for improved natural language disambiguation
Year:
2003

Citing 0
Cited 1

The infocious web search engine: improving web searching through linguistic analysis

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recovering the semantics, or the meaning, expressed using natural languages is one of the central goals of the field of natural language processing. The task is challenging due to the large number of ambiguities present in natural languages, such as part-of-speech assignments, structural dependencies, and word sense ambiguity. Reliably resolving these ambiguities would allow computers to gain access to the knowledge represented using natural languages, the format most commonly used in human communications. In this thesis we approach this complex problem by first decomposing it into four smaller problems, or subtasks: part of speech (POS) tagging, word sense disambiguation (WSD), chunking, and parsing. With this decomposition each subtask captures a subset of the ambiguities, thus simplifying the problem and facilitating the application of exact algorithms to maximize accuracy. We then integrate the decisions from the subtasks to form the most plausible interpretation, in contrast to top-down modeling that is prone to error propagation. We first apply machine learning algorithms to automatically train probabilistic models based on annotated training corpus. We select a powerful probability model, maximum entropy, to incorporate diverse contexts systematically. We then capture the dependencies between the words within sentences using Bayesian networks, with which we compute disambiguation decisions using an exact inferencing algorithm. To share information between subtasks, we introduce a new structural representation, called Cores-and-Modifiers, to succinctly describe structural features in improving both POS tagging and WSD. We also identify semantic contexts based on the WordNet lexical database to improve both chunking and parsing accuracy. To form the overall interpretation across these subtasks, we introduce an integrative process, instead of a top-down model. Because the entire model is probabilistic, it enables the systematic re-integration of the diverse decisions from the subtasks based on their probabilities. To further improve this integration, we present the concept that the most uncertain decisions are the most error-prone, and thus their alternatives should also be examined. This process, named Most-probable Hypotheses Evaluation, selectively examines a small set of alternate hypotheses to better determine the most plausible interpretation across subtasks, instead of as disparate decisions. The resulting model, named Integrative, Probabilistic Natural-language Parser and Interpreter (IPNPI), integrates the four subtasks to form the most plausible interpretations. The IPNPI model is evaluated by its accuracy on each of the four subtasks using standardized procedures, and we show that it improves the state-of-the-art accuracy in POS tagging, word sense disambiguation, chunking, and parsing. The synthesis of our separate-but-integrative approach is that the IPNPI model is able to accurately and efficiently resolve natural language ambiguities, by producing the most plausible interpretations across part of speech assignment, word sense distinction, phrasal identification, and structural dependencies.