A maximum entropy approach to identifying sentence boundaries

Authors:
Jeffrey C. Reynar;Adwait Ratnaparkhi
Affiliations:
University of Pennsylvania, Philadelphia, Pennsylvania;University of Pennsylvania, Philadelphia, Pennsylvania
Venue:
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Year:
1997

Citing 6
Cited 104

Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
Adaptive sentence boundary disambiguation

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
A new statistical parser based on bigram lexical dependencies

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics

Conversation map: a content-based Usenet newsgroup browser

Proceedings of the 5th international conference on Intelligent user interfaces
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
Efficient web browsing on handheld devices using page and form summarization

ACM Transactions on Information Systems (TOIS)
Learning-based Intrasentence Segmentation for Efficient Translation of Long Sentences

Machine Translation
Periods, capitalized words, etc.

Computational Linguistics
Generic and Query-Based Text Summarization Using Lexical Cohesion

AI '02 Proceedings of the 15th Conference of the Canadian Society for Computational Studies of Intelligence on Advances in Artificial Intelligence
Capitalization Recovery for Text

Information Retrieval Techniques for Speech Applications [this book is based on the workshop “Information Retrieval Techniques for Speech Applications”, held as part of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in New Orleans, USA, in September 2001].
A statistical information extraction system for Turkish

Natural Language Engineering
Experiments on sentence boundary detection

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Tagging sentence boundaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Feature lattices for maximum entropy modelling

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Maximum entropy model learning of the translation rules

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Toward a scoring function for quality-driven machine translation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A Dynamic Programming Algorithm for Linear Text Segmentation

Journal of Intelligent Information Systems
Robust document image understanding technologies

Proceedings of the 1st ACM workshop on Hardcopy document processing
A flexible distributed architecture for NLP system development and use

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Ordering among premodifiers

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A knowledge-free method for capitalized word disambiguation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Analysis system of speech acts and discourse structures using maximum entropy model

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Performance evaluation for text processing of noisy inputs

Proceedings of the 2005 ACM symposium on Applied computing
Sequential conditional Generalized Iterative Scaling

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Comma restoration using constituency information

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Reliable measures for aligning Japanese-English news articles and sentences

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Coreference for NLP applications

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Joint visual-text modeling for automatic retrieval of multimedia documents

Proceedings of the 13th annual ACM international conference on Multimedia
Reducing parsing complexity by intra-sentence segmentation based on maximum entropy model

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
The Smart/Empire TIPSTER IR system

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998
Overview of the University of Pennsylvania's TIPSTER project: University of Pennsylvania

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998
Updating an NLP system to fit new domains: an empirical study on the sentence segmentation problem

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Utterance segmentation using combined approach based on Bi-directional N-gram and maximum entropy

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
A fast algorithm for feature selection in conditional maximum entropy modeling

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Summarization of noisy documents: a pilot study

HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
Building automatically a business registration ontology

dg.o '02 Proceedings of the 2002 annual national conference on Digital government research
Broad coverage paragraph segmentation across languages and domains

ACM Transactions on Speech and Language Processing (TSLP)
Supersense tagging of unknown nouns using semantic similarity

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A progressive feature selection algorithm for ultra large feature spaces

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Dependency structure analysis and sentence boundary detection in spontaneous Japanese

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
A translation model for sentence retrieval

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach

Artificial Intelligence in Medicine
Unsupervised Multilingual Sentence Boundary Detection

Computational Linguistics
An automated system for conversion of clinical notes into SNOMED clinical terminology

ACSW '07 Proceedings of the fifth Australasian symposium on ACSW frontiers - Volume 68
Deep semantic interpretations of legal texts

Proceedings of the 11th international conference on Artificial intelligence and law
Structured retrieval for question answering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Conversation Map: An Interface for Very Large-Scale Conversations

Journal of Management Information Systems
Automated story capture from internet weblogs

Proceedings of the 4th international conference on Knowledge capture
Blind men and elephants: What do citation summaries tell us about a research article?

Journal of the American Society for Information Science and Technology
Intra-sentence segmentation based on support vector machines in English-Korean machine translation systems

Expert Systems with Applications: An International Journal
Optical character recognition errors and their effects on natural language processing

Proceedings of the second workshop on Analytics for noisy unstructured text data
Using automatically labelled examples to classify rhetorical relations: An assessment

Natural Language Engineering
Dialogue Based Text Editing

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
A random walk on the red carpet: rating movies with user reviews and pagerank

Proceedings of the 17th ACM conference on Information and knowledge management
Domain action classification using a maximum entropy model in a schedule management domain

AI Communications
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Tagging Sentence Boundaries in Biomedical Literature

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Regression Rank: Learning to Meet the Opportunity of Descriptive Queries

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Literature mining on pharmacokinetics numerical data: A feasibility study

Journal of Biomedical Informatics
Using argumentation to retrieve articles with similar citations from MEDLINE

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
An improved markov random field model for supporting verbose queries

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
High precision retrieval using relevance-flow graph

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Applications of lexical information for algorithmically composing multiple-choice cloze items

EdAppsNLP 05 Proceedings of the second workshop on Building Educational Applications Using NLP
May all your wishes come true: a study of wishes and how to recognize them

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Sentence boundary detection and the problem with the U.S.

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Improving text retrieval precision and answer accuracy in question answering systems

IRQA '08 Coling 2008: Proceedings of the 2nd workshop on Information Retrieval for Question Answering
Comparison of similarity models for the relation discovery task

LD '06 Proceedings of the Workshop on Linguistic Distances
Dependency-based paraphrasing for recognizing textual entailment

RTE '07 Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing
Semantic inference at the lexical-syntactic level for textual entailment recognition

RTE '07 Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing
Accurate argumentative zoning with maximum entropy models

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
E-Gen: automatic job offer processing system for human resources

MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

ACM Transactions on Asian Language Information Processing (TALIP)
Chinese utterance segmentation in spoken language translation

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Using ILP to construct features for information extraction from semi-structured text

ILP'07 Proceedings of the 17th international conference on Inductive logic programming
Profiting from mark-up: hyper-text annotations for guided parsing

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Voice of the customers: mining online customer reviews for product feature-based ranking

WOSN'10 Proceedings of the 3rd conference on Online social networks
Rank learning for factoid question answering with linguistic and semantic constraints

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Coverage-based methods for distributional stopword selection in text segmentation

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Efficient appointment information extraction from short messages in mobile devices with limited hardware resources

Pattern Recognition Letters
Automatic word spacing of erroneous sentences in mobile devices with limited hardware resources

Information Processing and Management: an International Journal
Question detection in spoken conversations using textual conversations

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Chinese sentence segmentation as comma classification

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Why is "SXSW" trending?: exploring multiple text sources for Twitter topic summarization

LSM '11 Proceedings of the Workshop on Languages in Social Media
AZOM: a Persian structured text summarizer

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Robust argumentative zoning for sensemaking in scholarly documents

NLP4DL'09/AT4DL'09 Proceedings of the 2009 international conference on Advanced language technologies for digital libraries
Reduction of maximum entropy models to hidden markov models

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Sentence boundary detection in turkish

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Detecting sentence boundaries in japanese speech transcriptions using a morphological analyzer

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Learning to fuse disparate sentences

MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
Multilingual sentence hunter

WISE'05 Proceedings of the 2005 international conference on Web Information Systems Engineering
A comparative evaluation of a new unsupervised sentence boundary detection approach on documents in english and portuguese

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Automatic time expression labeling for english and chinese text

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Period disambiguation with maxent model

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Features combination for extracting gene functions from MEDLINE

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
A case study of using web search statistics: case restoration

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
A statistical prediction model of speakers' intentions using multi-level features in a goal-oriented dialog system

Pattern Recognition Letters
Say Anything: Using Textual Case-Based Reasoning to Enable Open-Domain Interactive Storytelling

ACM Transactions on Interactive Intelligent Systems (TiiS) - Special Issue on Common Sense for Interactive Systems
Mining millions of reviews: a technique to rank products based on importance of reviews

Proceedings of the 13th International Conference on Electronic Commerce
Character-based kernels for novelistic plot structure

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
A corpus of textual revisions in second language writing

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
A coherence model based on syntactic patterns

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
A chinese sentence segmentation approach based on comma

CLSW'12 Proceedings of the 13th Chinese conference on Chinese Lexical Semantics
Relevant learning objects extraction based on semantic annotation

International Journal of Metadata, Semantics and Ontologies
Analysis of eligibility criteria representation in industry-standard clinical trial protocols

Journal of Biomedical Informatics
Multi-document text summarization using topic model and fuzzy logic

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.