Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
On building a more efficient grammar by exploiting types
Natural Language Engineering
Multi-tagging for lexicalized-grammar parsing
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies
CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Ambiguous part-of-speech tagging for improving accuracy and domain portability of syntactic parsers
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
"cba to check the spelling" investigating parser performance on discussion forum posts
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Practical very large scale CRFs
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Syntactic scope resolution in uncertainty analysis
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Developing a robust part-of-speech tagger for biomedical text
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
CuteForce: deep deterministic HPSG parsing
IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Large-scale corpus-driven PCFG approximation of an HPSG
IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Hi-index | 0.00 |
In this work, we investigate the use of sequence labeling techniques for tokenization, arguably the most foundational task in NLP, which has been traditionally approached through heuristic finite-state rules. Observing variation in tokenization conventions across corpora and processing tasks, we train and test multiple CRF binary sequence labelers and obtain substantial reductions in tokenization error rate over off-the-shelf standard tools. From a domain adaptation perspective, we experimentally determine the effects of training on mixed gold-standard data sets and make a tentative recommendation for practical usage. Furthermore, we present a perspective on this work as a feedback mechanism to resource creation, i.e. error detection in annotated corpora. To investigate the limits of our approach, we study an interpretation of the tokenization problem that shows stark contrasts to …classic' schemes, presenting many more token-level ambiguities to the sequence labeler (reflecting use of punctuation and multi-word lexical units). In this setup, we also look at partial disambiguation by presenting a token lattice to downstream processing.