A lazy man's way to part-of-speech tagging

Authors:
Norshuhani Zamin;Alan Oxley;Zainab Abu Bakar;Syed Ahmad Farhan
Affiliations:
Faculty of Science and Information Technology, Universiti Teknologi Mara, Shah Alam, Selangor, Malaysia;Faculty of Science and Information Technology, Universiti Teknologi Mara, Shah Alam, Selangor, Malaysia;Faculty of Computer and Mathematical Sciences, Universiti Teknologi Mara, Shah Alam, Selangor, Malaysia;Faculty of Engineering, Universiti Teknologi PETRONAS, Tronoh, Perak, Malaysia
Venue:
PKAW'12 Proceedings of the 12th Pacific Rim conference on Knowledge Management and Acquisition for Intelligent Systems
Year:
2012

Citing 8
Cited 0

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Rules and Algorithms for Phonetic Transcription of Standard Malay

IEICE - Transactions on Information and Systems
Handbook of Natural Language Processing

Handbook of Natural Language Processing
Two decades of unsupervised POS induction: how far have we come?

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Developing a robust part-of-speech tagger for biomedical text

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
N-gram similarity and distance

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

A statistical-based approach to word alignment involving automatically projecting part-of-speech (POS) tags is presented. The approach is referred to as the "lazy man's way" because it improves POS assignment for a resource-poor language by exploiting its similarity to a resource-rich one. This unsupervised learning method combines the N-gram and Dice Coefficient similarity functions in order to align English texts with Malay texts thus projecting the POS tags from English to Malay. It is a quick method that does not require the laborious effort needed to annotate the Malay dataset. A case study, an experiment done on 25 terrorism news articles written in Malay, has shown that leveraging pre-existing resources from a resource-rich language, i.e. English, to supplement a resource-poor language, i.e. Malay, is feasible and avoids building new text-processing tools from scratch. The system was tested on the Malay corpus, consisting of 5413 word tokens. The results reached values of 86.87% for precision, 72.56% for recall and 79.07% for F1-Score. This shows that the "lazy man's way", where a resource-poor language just exploits the rich linguistic information available in English, increases bitext projection accuracy significantly.