Formal Methods of Tokenization for Part-of-Speech Tagging

Authors:
Jorge Graña;Francisco-Mario Barcala;Jesús Vilares Ferro
Affiliations:
-;-;-
Venue:
CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2002

Citing 4
Cited 6

Document centered approach to text normalization

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Tagging sentence boundaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A knowledge-free method for capitalized word disambiguation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Practical NLP-Based Text Indexing

IBERAMIA 2002 Proceedings of the 8th Ibero-American Conference on AI: Advances in Artificial Intelligence
Extraction of complex index terms in non-English IR: A shallow parsing based approach

Information Processing and Management: an International Journal
Current research issues and trends in non-English Web searching

Information Retrieval
XML rules for enclitic segmentation

EUROCAST'07 Proceedings of the 11th international conference on Computer aided systems theory
Towards the automatic learning of idiomatic prepositional phrases

MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence
COLE experiments at QA@CLEF 2004 spanish monolingual track

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most important prior tasks for robust part-of-speech tagging is the correct tokenization or segmentation of the texts. This task can involve processes which are much more complex than the simple identification of the different sentences in the text and each of their individual components, but it is often obviated in many current applications.Nevertheless, this preprocessing step is an indispensable task in practice, and it is particularly difficult to tackle it with scientific precision without falling repeatedly in the analysis of the specific casuistry of every phenomenon detected.In this work, we have developed a scheme of preprocessing oriented towards the disambiguation and robust tagging of Galician. Nevertheless, it is a proposal of a general architecture that can be applied to other languages, such as Spanish, with very slight modifications.