The design, implementation, and use of the Ngram statistics package

Authors:
Satanjeev Banerjee;Ted Pedersen
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;University of Minnesota, Duluth, MN
Venue:
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Year:
2003

Citing 7
Cited 52

Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
MARSYAS: a framework for audio analysis

Organised Sound
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
A decision tree of bigrams is an accurate predictor of word sense

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Acquiring collocations for lexical choice between near-synonyms

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Significant lexical relationships

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Machine learning with lexical features: the Duluth approach to Senseval-2

SENSEVAL '01 The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems

Beyond lexical units: enriching wordnets with phrasets

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Building and Using a Lexical Knowledge Base of Near-Synonym Differences

Computational Linguistics
Towards applying text mining and natural language processing for biomedical ontology acquisition

TMBIO '06 Proceedings of the 1st international workshop on Text mining in bioinformatics
A bio-inspired approach for multi-word expression extraction

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Whose thumb is it anyway?: classifying author personality from weblog text

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Automatic labeling of multinomial topic models

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Recommending questions using the mdl-based tree cut model

Proceedings of the 17th international conference on World Wide Web
XML-aided phrase indexing for hypertext documents

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Definitions in Court Decisions --Automatic Extraction and Ontology Acquisition

Proceedings of the 2009 conference on Law, Ontologies and the Semantic Web: Channelling the Legal Information Flood
First Steps Towards the Automatic Construction of Argument-Diagrams from Real Discussions

Proceedings of the 2006 conference on Computational Models of Argument: Proceedings of COMMA 2006
Multi-word term extraction for Bulgarian

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
Web Search Clustering and Labeling with Hidden Topics

ACM Transactions on Asian Language Information Processing (TALIP)
Determining the syntactic structure of medical terms in clinical notes

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
An end-to-end supervised target-word sense disambiguation system

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Esfinge: a question answering system in the web using the web

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Feature subsumption for opinion analysis

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
SenseClusters: finding clusters that represent word senses

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Automatic extraction of definitions from German court decisions

IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
Learning domain-specific information extraction patterns from the Web

IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
An empirical study of corpus-based response automation methods for an e-mail-based help-desk domain

Computational Linguistics
Statistically-driven alignment-based multiword expression identification for technical domains

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
Issues on quality assessment of SNOMED CT® subsets: term validation and term extraction

WBIE '09 Proceedings of the Workshop on Biomedical Information Extraction
Scientific authoring support: a tool to navigate in typed citation graphs

CL&W '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids
Shedding (a thousand points of) light on biased language

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Experts' retrieval with multiword-enhanced author topic model

SS '10 Proceedings of the NAACL HLT 2010 Workshop on Semantic Search
The TermiNet project: an overview

YIWCALA '10 Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
Extracting and classifying Urdu multiword expressions

HLT-SS '11 Proceedings of the ACL 2011 Student Session
Automatic labelling of topic models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Automatic extraction of NV expressions in Basque: basic issues on cooccurrence techniques

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
jMWE: a Java toolkit for detecting multi-word expressions

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
The ngram statistics package (Text::NSP): a flexible tool for identifying ngrams, collocations, and word associations

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Fast and flexible MWE candidate generation with the mwetoolkit

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Identifying collocations to measure compositionality: shared task system description

DiSCo '11 Proceedings of the Workshop on Distributional Semantics and Compositionality
A New Language Model Combining Single and Compound Terms

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
20th century esfinge (sphinx) solving the riddles at CLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
A statistical medical summary translation system

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
A hybrid approach for multiword expression identification

PROPOR'10 Proceedings of the 9th international conference on Computational Processing of the Portuguese Language
Distributional thesaurus versus wordnet: a comparison of backoff techniques for unsupervised PP attachment

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Unsupervised learning of p NP p word combinations

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Towards the automatic learning of idiomatic prepositional phrases

MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence
The role of multi-word units in interactive information retrieval

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
First evaluation of esfinge: a question answering system for portuguese

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Corpus-Based compositionality

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language
A comparative study of information-gathering approaches for answering help-desk email inquiries

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Improving portuguese term extraction

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
Evaluation of clustering algorithms for word sense disambiguation

International Journal of Data Analysis Techniques and Strategies
Discovery of novel term associations in a document collection

Bisociative Knowledge Discovery
Efficient mining of correlated sequential patterns based on null hypothesis

Proceedings of the 2012 international workshop on Web-scale knowledge representation, retrieval and reasoning
Experiments for the cross language speech retrieval task at CLEF 2006

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Computing n-gram statistics in MapReduce

Proceedings of the 16th International Conference on Extending Database Technology
Modeling the internal variability of multiword expressions through a pattern-based method

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 1
On collocations and topic models

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Ngram Statistics Package (NSP) is a flexible and easy-to-use software tool that supports the identification and analysis of Ngrams, sequences of N tokens in online text. We have designed and implemented NSP to be easy to customize to particular problems and yet remain general enough to serve a broad range of needs. This paper provides an introduction to NSP while raising some general issues in Ngram analysis, and summarizes several applications where NSP has been successfully employed. NSP is written in Perl and is freely available under the GNU Public License.