Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
An information extraction engine for web discussion forums
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora
Computational Linguistics
Extracting parallel sub-sentential fragments from non-parallel corpora
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Directions for exploiting asymmetries in multilingual Wikipedia
CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Automatic Data Extraction from Web Discussion Forums
FCST '09 Proceedings of the 2009 Fourth International Conference on Frontier of Computer Science and Technology
Hi-index | 0.00 |
As the title suggests, our paper deals with web discussion fora, whose content can be considered to be a special type of comparable corpora. We discuss the potential of this vast amount of data available now on the World Wide Web nearly for every language, regarding both general and common topics as well as the most obscure and specific ones. To illustrate our ideas, we propose a case study of seven wedding discussion fora in five languages.