Word association norms, mutual information, and lexicography
Computational Linguistics
An evaluation of phrasal and clustered representations on a text categorization task
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Class-based n-gram models of natural language
Computational Linguistics
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Information Retrieval
Language Modeling for Information Retrieval
Language Modeling for Information Retrieval
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Lexical query paraphrasing for document retrieval
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A syntactically-based query reformulation technique for information retrieval
Information Processing and Management: an International Journal
A term dependency-based approach for query terms ranking
Proceedings of the 18th ACM conference on Information and knowledge management
Selecting Effective Terms for Query Formulation
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Light syntactically-based index pruning for information retrieval
ECIR'07 Proceedings of the 29th European conference on IR research
Investigating the statistical properties of user-generated documents
FQAS'11 Proceedings of the 9th international conference on Flexible Query Answering Systems
Hi-index | 0.00 |
We investigate the connection between part of speech (POS) distribution and content in language. We define POS blocks to be groups of parts of speech. We hypothesise that there exists a directly proportional relation between the frequency of POS blocks and their content salience. We also hypothesise that the class membership of the parts of speech within such blocks reflects the content load of the blocks, on the basis that open class parts of speech are more content-bearing than closed class parts of speech. We test these hypotheses in the context of Information Retrieval, by syntactically representing queries, and removing from them content-poor blocks, in line with the aforementioned hypotheses. For our first hypothesis, we induce POS distribution information from a corpus, and approximate the probability of occurrence of POS blocks as per two statistical estimators separately. For our second hypothesis, we use simple heuristics to estimate the content load within POS blocks. We use the Text REtrieval Conference (TREC) queries of 1999 and 2000 to retrieve documents from the WT2G and WT10G test collections, with five different retrieval strategies. Experimental outcomes confirm that our hypotheses hold in the context of Information Retrieval.