A stop list for general text

Authors:
Christopher Fox
Affiliations:
-
Venue:
ACM SIGIR Forum
Year:
1989

Citing 2
Cited 41

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Information Retrieval

Information Retrieval

Posting compression in dynamic retrieval environments

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental relevance feedback

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Trigrams as index element in full text retrieval: observations and experimental results

CSC '93 Proceedings of the 1993 ACM conference on Computer science
A document retrieval model based on term frequency ranks

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Collecting user access patterns for building user profiles and collaborative filtering

IUI '99 Proceedings of the 4th international conference on Intelligent user interfaces
Supporting classroom information management with SCOUT

ACM-SE 37 Proceedings of the 37th annual Southeast regional conference (CD-ROM)
The use of phrases from query texts in information retrieval (poster session)

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical indexing and document matching in BoW

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
A feature mining based approach for the classification of text documents into disjoint classes

Information Processing and Management: an International Journal
Cross-language information retrieval: experiments based on CLEF 2000 corpora

Information Processing and Management: an International Journal
SQL text parsing for information retrieval

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Improving Efficiency and Relevance Ranking in Information Retrieval

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Technical issues of cross-language information retrieval: a review

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Incorporating context in text analysis by interactive activation with competition artificial neural networks

Information Processing and Management: an International Journal
On the strength of hyperclique patterns for text categorization

Information Sciences: an International Journal
Searching strategies for the Hungarian language

Information Processing and Management: an International Journal
Document retrieval for question answering: a quantitative evaluation of text preprocessing

Proceedings of the ACM first Ph.D. workshop in CIKM
Current research issues and trends in non-English Web searching

Information Retrieval
Entropy-Based Static Index Pruning

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Indexing and stemming approaches for the Czech language

Information Processing and Management: an International Journal
Indonesian-Japanese CLIR using only limited resource

CLIIR '06 Proceedings of the Workshop on How Can Computational Linguistics Improve Information Retrieval?
When stopword lists make the difference

Journal of the American Society for Information Science and Technology
Re-ranking Documents Based on Query-Independent Document Specificity

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Incorporating context in text analysis by interactive activation with competition artificial neural networks

Information Processing and Management: an International Journal
Static pruning of terms in inverted files

ECIR'07 Proceedings of the 29th European conference on IR research
Viewing term proximity from a different perspective

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)
Source code indexing for automated tracing

Proceedings of the 6th International Workshop on Traceability in Emerging Forms of Software Engineering
Accuracy of inter-researcher similarity measures based on topical and social clues

Scientometrics
A text-based decision support system for financial sequence prediction

Decision Support Systems
Query transitive translation using IR score for indonesian-japanese CLIR

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Multilevel legal ontologies

Semantic Processing of Legal Texts
A distributional semantics approach to simultaneous recognition of multiple classes of named entities

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
The influence of collocation segmentation and top 10 items to keyword assignment performance

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Authorship Attribution Based on Specific Vocabulary

ACM Transactions on Information Systems (TOIS)
Detecting weak signals for long-term business opportunities using text mining of Web news

Expert Systems with Applications: An International Journal
On the effect of stopword removal for SMS-Based FAQ retrieval

NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)
An empirical evaluation of stop word removal in statistical machine translation

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
A user term visualization analysis based on a social question and answer log

Information Processing and Management: an International Journal
Semantic Approach to Web-Based Discovery of Unknowns to Enhance Intelligence Gathering

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

A stop list, or negative dictionary is a device used in automatic indexing to filter out words that would make poor index terms. Traditionally stop lists are supposed to have included only the most frequently occurring words. In practice, however, stop lists have tended to include infrequently occurring words, and have not included many frequently occurring words. Infrequently occurring words seem to have been included because stop list compilers have not, for whatever reason, consulted empirical studies of word frequencies. Frequently occurring words seem to have been left out for the same reason, and also because many of them might still be important as index terms.This paper reports an exercise in generating a stop list for general text based on the Brown corpus of 1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that they are too important as potential index terms. Twenty-six words are then added to the list in the belief that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list because the finite state machine based filter in which this list is intended to be used is able to filter them at almost no cost. The final product is a list of 421 stop words that should be maximally efficient and effective in filtering the most frequently occurring and semantically neutral words in general literature in English.