Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Authors:
Christoph Ringlstetter;Klaus U. Schulz;Stoyan Mihov
Affiliations:
-;CIS, University of Munich (Funded by German Research Foundation (DFG));Bulgarian Academy of Science, Sofia (Funded by VolkswagenStiftung)
Venue:
Computational Linguistics
Year:
2006

Citing 22
Cited 5

Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Use of syntactic context to produce term association lists for text retrieval

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Statistical methods for speech recognition

Statistical methods for speech recognition
Efficient Error-Correcting Viterbi Parsing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Extracting classification knowledge of Internet documents with mining term associations: a semantic approach

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Using Compression to Identify Acronyms in Text

DCC '00 Proceedings of the Conference on Data Compression
Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary?

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
wEBMT: developing and validating an example-based machine translation system using the world wide web

Computational Linguistics - Special issue on web as corpus
Introduction to the special issue on computational linguistics using large corpora

Computational Linguistics - Special issue on using large corpora: I
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
An algorithm for finding noun phrase correspondences in bilingual corpora

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Automatically extracting and representing collocations for language generation

ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
Building an MT dictionary from parallel texts based on linguistic and statistical information

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
The automatic extraction of open compounds from text corpora

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Stochastic language generation for spoken dialogue systems

ANLP/NAACL-ConvSyst '00 Proceedings of the 2000 ANLP/NAACL Workshop on Conversational systems - Volume 3
Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style

Journal of the American Society for Information Science and Technology

Adaptive text correction with Web-crawled domain-dependent dictionaries

ACM Transactions on Speech and Language Processing (TSLP)
Parallel identification of the spelling variants in corpora

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Using automated error profiling of texts for improved selection of correction candidates for garbled tokens

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Non-interactive OCR post-correction for giga-scale digitization projects

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
On measuring the lexical quality of the web

Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality

Quantified Score

Hi-index	0.00

Visualization

Abstract

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.