Identifying word correspondence in parallel texts
HLT '91 Proceedings of the workshop on Speech and Natural Language
Use of syntactic context to produce term association lists for text retrieval
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
Statistical methods for speech recognition
Statistical methods for speech recognition
Efficient Error-Correcting Viterbi Parsing
IEEE Transactions on Pattern Analysis and Machine Intelligence
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Using Compression to Identify Acronyms in Text
DCC '00 Proceedings of the Conference on Data Compression
Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary?
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Introduction to the special issue on the web as corpus
Computational Linguistics - Special issue on web as corpus
Computational Linguistics - Special issue on web as corpus
Computational Linguistics - Special issue on web as corpus
Introduction to the special issue on computational linguistics using large corpora
Computational Linguistics - Special issue on using large corpora: I
Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
Automatic detection of text genre
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
An algorithm for finding noun phrase correspondences in bilingual corpora
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Automatically extracting and representing collocations for language generation
ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
Building an MT dictionary from parallel texts based on linguistic and statistical information
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
The automatic extraction of open compounds from text corpora
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Statistical phrase-based translation
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Stochastic language generation for spoken dialogue systems
ANLP/NAACL-ConvSyst '00 Proceedings of the 2000 ANLP/NAACL Workshop on Conversational systems - Volume 3
Journal of the American Society for Information Science and Technology
Adaptive text correction with Web-crawled domain-dependent dictionaries
ACM Transactions on Speech and Language Processing (TSLP)
Parallel identification of the spelling variants in corpora
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Non-interactive OCR post-correction for giga-scale digitization projects
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
On measuring the lexical quality of the web
Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
Hi-index | 0.00 |
Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.