Machine learning for high-quality tokenization replicating variable tokenization schemes

Authors:
Murhaf Fares;Stephan Oepen;Yi Zhang
Affiliations:
Institutt for Informatikk, Universitetet i Oslo, Norway;Institutt for Informatikk, Universitetet i Oslo, Norway;LT-Lab, German Research Center for Artificial Intelligence, Germany
Venue:
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Year:
2013

Citing 14
Cited 0

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
On building a more efficient grammar by exploiting types

Natural Language Engineering
Multi-tagging for lexicalized-grammar parsing

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
OntoNotes: the 90% solution

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Ambiguous part-of-speech tagging for improving accuracy and domain portability of syntactic parsers

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
"cba to check the spelling" investigating parser performance on discussion forum posts

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Practical very large scale CRFs

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Syntactic scope resolution in uncertainty analysis

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Developing a robust part-of-speech tagger for biomedical text

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Multiword expression identification with tree substitution grammars: a parsing tour de force with French

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
CuteForce: deep deterministic HPSG parsing

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Large-scale corpus-driven PCFG approximation of an HPSG

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Tokenization: returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work, we investigate the use of sequence labeling techniques for tokenization, arguably the most foundational task in NLP, which has been traditionally approached through heuristic finite-state rules. Observing variation in tokenization conventions across corpora and processing tasks, we train and test multiple CRF binary sequence labelers and obtain substantial reductions in tokenization error rate over off-the-shelf standard tools. From a domain adaptation perspective, we experimentally determine the effects of training on mixed gold-standard data sets and make a tentative recommendation for practical usage. Furthermore, we present a perspective on this work as a feedback mechanism to resource creation, i.e. error detection in annotated corpora. To investigate the limits of our approach, we study an interpretation of the tokenization problem that shows stark contrasts to …classic' schemes, presenting many more token-level ambiguities to the sequence labeler (reflecting use of punctuation and multi-word lexical units). In this setup, we also look at partial disambiguation by presenting a token lattice to downstream processing.