Unsupervised Multilingual Sentence Boundary Detection

Authors:
Tibor Kiss;Jan Strunk
Affiliations:
-;-
Venue:
Computational Linguistics
Year:
2006

Citing 13
Cited 20

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Information Retrieval

Information Retrieval
Periods, capitalized words, etc.

Computational Linguistics
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
Tagging sentence boundaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Scaled log likelihood ratios for the detection of abbreviations in text corpora

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Methods for the qualitative evaluation of lexical association measures

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Some applications of tree-based modelling to speech and language

HLT '89 Proceedings of the workshop on Speech and Natural Language
A comparative evaluation of a new unsupervised sentence boundary detection approach on documents in english and portuguese

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing

A scalable global model for summarization

ILP '09 Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing
Sentence boundary detection and the problem with the U.S.

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Accurate learning for Chinese function tags from minimal features

ACLstudent '09 Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
Restoring Punctuation and Casing in English Text

AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Named entity recognition in Wikipedia

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Evaluating a statistical CCG parser on Wikipedia

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Automatic summarisation of discussion fora

Natural Language Engineering
Distinguishing use and mention in natural language

HLT-SRWS '10 Proceedings of the NAACL HLT 2010 Student Research Workshop
Learning simple Wikipedia: a cogitation in ascertaining abecedarian language

CL&W '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids
Towards semantic microaggregation of categorical data for confidential documents

MDAI'10 Proceedings of the 7th international conference on Modeling decisions for artificial intelligence
A comparative evaluation of a new unsupervised sentence boundary detection approach on documents in english and portuguese

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Extracting definitions from brazilian legal texts

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part III
Cross-lingual genre classification

EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Design of a hybrid high quality machine translation system

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
NAIST at the HOO 2012 shared task

Proceedings of the Seventh Workshop on Building Educational Applications Using NLP
Automatically generated NE tagged corpora for English and Hungarian

NEWS '12 Proceedings of the 4th Named Entity Workshop
Non-syntactic word prediction for AAC

SLPAT '12 Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies
Learning multilingual named entity recognition from Wikipedia

Artificial Intelligence
Improving search result summaries by using searcher behavior data

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Discovering collaborative knowledge-intensive processes through e-mail mining

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely, the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.