Corpus-based stemming using cooccurrence of word variants

Authors:
Jinxi Xu;W. Bruce Croft
Affiliations:
Univ. of Massachusetts, Amherst;Univ. of Massachusetts, Amherst
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
1998

Citing 9
Cited 79

Automatic text processing

Automatic text processing
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Query expansion using lexical-semantic relations

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Natural language vs. Boolean query evaluation: a comparison of retrieval performance

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Viewing stemming as recall enhancement

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics

Resolving ambiguity for cross-language retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The impact on retrieval effectiveness of skewed frequency distributions

ACM Transactions on Information Systems (TOIS)
An algorithm for term conflation based on tree structures

Journal of the American Society for Information Science and Technology
Visualizing content based relations in texts

AUIC '01 Proceedings of the 2nd Australasian conference on User interface
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
On arabic search: improving the retrieval effectiveness via a light stemming approach

Proceedings of the eleventh international conference on Information and knowledge management
Retrieving with Good Sense

Information Retrieval
Automatic discovery of similarity relationships through Web mining

Decision Support Systems - Web retrieval and mining
Automatic Profile Reformulation Using a Local Document Analysis

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Automatic Acquisition of Morphological Knowledge for Medical Language Processing

AIMDM '99 Proceedings of the Joint European Conference on Artificial Intelligence in Medicine and Medical Decision Making
Automatic Language-Specific Stemming in Information Retrieval

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Pattern extraction method for text classification

Technologies for constructing intelligent systems
Probabilistic term variant generator for biomedical terms

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Single n-gram stemming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Letter to the editor: the practice and malpractice of stemming

Journal of the American Society for Information Science and Technology
A novel method for stemmer generation based on hidden markov models

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Building an inflectional stemmer for Bulgarian

CompSysTech '03 Proceedings of the 4th international conference conference on Computer systems and technologies: e-Learning
Arabic morphological analysis techniques: a comprehensive survey

Journal of the American Society for Information Science and Technology
Scoring missing terms in information retrieval tasks

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Empirical studies on the impact of lexical resources on CLIR performance

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Using similarity scoring to improve the bilingual dictionary for word alignment

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Knowledge-free induction of inflectional morphologies

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
A categorial variation database for English

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Optimizing story link detection is not equivalent to optimizing new event detection

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Discourse segmentation of multi-party conversation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Stemming Indonesian

ACSC '05 Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38
An Approach for Stemming in Symbolically Compressed Indian Language Imaged Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Cross-lingual information retrieval using hidden Markov models

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
A framework for understanding latent semantic indexing (LSI) performance

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Qualitative evaluation of automatic assignment of keywords to images

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
Design, implementation, and evaluation of a methodology for automatic stemmer generation

Journal of the American Society for Information Science and Technology
Argumentative feedback: a linguistically-motivated term expansion for information retrieval

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Context sensitive stemming for web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
YASS: Yet another suffix stripper

ACM Transactions on Information Systems (TOIS)
Restricted inflectional form generation in management of morphological keyword variation

Information Retrieval
Searching strategies for the Hungarian language

Information Processing and Management: an International Journal
Stemming Indonesian: A confix-stripping approach

ACM Transactions on Asian Language Information Processing (TALIP)
Automatic acquisition of inflectional lexica for morphological normalisation

Information Processing and Management: an International Journal
Topic models and a revisit of text-related applications

Proceedings of the 2nd PhD workshop on Information and knowledge management
A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web
Current research issues and trends in non-English Web searching

Information Retrieval
A lemmatization method for Mongolian and its application to indexing for information retrieval

Information Processing and Management: an International Journal
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Unsupervised learning of the morpho-semantic relationship in MEDLINE®

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Indexing and stemming approaches for the Czech language

Information Processing and Management: an International Journal
An evaluation study of clustering algorithms in the scope of user communities assessment

Computers & Mathematics with Applications
Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Leveraging Higher Order Dependencies between Features for Text Classification

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Acquistion of the morphological structure of the lexicon based on lexical similarity and formal analogy

TextGraphs-3 Proceedings of the 3rd Textgraphs Workshop on Graph-Based Algorithms for Natural Language Processing
Indexing and searching strategies for the Russian language

Journal of the American Society for Information Science and Technology
A higher order collective classifier for detecting andclassifying network events

ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Morphology induction from term clusters

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
A framework for understanding Latent Semantic Indexing (LSI) performance

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Qualitative evaluation of automatic assignment of keywords to images

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Automatic morphological query expansion using analogy-based machine learning

ECIR'07 Proceedings of the 29th European conference on IR research
Semantic similarity measures for Malay sentences

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)
Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

ACM Transactions on Asian Language Information Processing (TALIP)
Digitization of Indian literature: problem and solution

Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India
RALI: Automatic weighting of text window distances

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
Towards an optimal weighting of context words based on distance

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
An accuracy-enhanced light stemmer for arabic text

ACM Transactions on Speech and Language Processing (TSLP)
Dual-sorted inverted lists

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
A Fast Corpus-Based Stemmer

ACM Transactions on Asian Language Information Processing (TALIP)
Implementation of a new method for stemming in Persian language

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
A novel corpus-based stemming algorithm using co-occurrence statistics

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An unsupervised method to improve Spanish stemmer

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
GRAS: An effective and efficient stemming algorithm for information retrieval

ACM Transactions on Information Systems (TOIS)
University of Otago at INEX 2010

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Distribution based stemmer refinement

PReMI'05 Proceedings of the First international conference on Pattern Recognition and Machine Intelligence
Text classification using small number of features

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Semantically enhanced text stemmer (SETS) for cross-domain document clustering

KES'12 Proceedings of the 16th international conference on Knowledge Engineering, Machine Learning and Lattice Computing with Applications
A corpus based approach for the automatic creation of arabic broken plural dictionaries

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Extraction of financial information from online business reports

ACM SIGMIS Database
Effective and Robust Query-Based Stemming

ACM Transactions on Information Systems (TOIS)
Experiments with query translation and re-ranking methods in Vietnamese-English bilingual information retrieval

Proceedings of the Fourth Symposium on Information and Communication Technology
Enhanced cross-domain document clustering with a semantically enhanced text stemmer SETS

International Journal of Knowledge-based and Intelligent Engineering Systems - Selected papers of KES2012-Part 2 of 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.