Examining the content load of part of speech blocks for information retrieval

Authors:
Christina Lioma;Iadh Ounis
Affiliations:
University of Glasgow, Scotland, U.K.;University of Glasgow, Scotland, U.K.
Venue:
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Year:
2006

Citing 8
Cited 5

Word association norms, mutual information, and lexicography

Computational Linguistics
An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Class-based n-gram models of natural language

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Information Retrieval

Information Retrieval
Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Lexical query paraphrasing for document retrieval

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1

A syntactically-based query reformulation technique for information retrieval

Information Processing and Management: an International Journal
A term dependency-based approach for query terms ranking

Proceedings of the 18th ACM conference on Information and knowledge management
Selecting Effective Terms for Query Formulation

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Light syntactically-based index pruning for information retrieval

ECIR'07 Proceedings of the 29th European conference on IR research
Investigating the statistical properties of user-generated documents

FQAS'11 Proceedings of the 9th international conference on Flexible Query Answering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the connection between part of speech (POS) distribution and content in language. We define POS blocks to be groups of parts of speech. We hypothesise that there exists a directly proportional relation between the frequency of POS blocks and their content salience. We also hypothesise that the class membership of the parts of speech within such blocks reflects the content load of the blocks, on the basis that open class parts of speech are more content-bearing than closed class parts of speech. We test these hypotheses in the context of Information Retrieval, by syntactically representing queries, and removing from them content-poor blocks, in line with the aforementioned hypotheses. For our first hypothesis, we induce POS distribution information from a corpus, and approximate the probability of occurrence of POS blocks as per two statistical estimators separately. For our second hypothesis, we use simple heuristics to estimate the content load within POS blocks. We use the Text REtrieval Conference (TREC) queries of 1999 and 2000 to retrieve documents from the WT2G and WT10G test collections, with five different retrieval strategies. Experimental outcomes confirm that our hypotheses hold in the context of Information Retrieval.