Bootstrapped language identification for multi-site internet domains
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
The seventeenth australasian document computing symposium
ACM SIGIR Forum
Hi-index | 0.00 |
Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short documents are systematically different from large ones. That techniques that work well on long documents are different from those that work well on short ones. We believe that improvement in algorithms will be seen if length is taken into account.