Context based spelling correction
Information Processing and Management: an International Journal
The nature of statistical learning theory
The nature of statistical learning theory
Machine Learning
A maximum entropy approach to natural language processing
Computational Linguistics
Factorial Hidden Markov Models
Machine Learning - Special issue on learning with probabilistic representations
Issues and approaches of database integration
Communications of the ACM
Learning to remove Internet advertisements
Proceedings of the third annual conference on Autonomous Agents
A machine learning based approach for table detection on the web
Proceedings of the 11th international conference on World Wide Web
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Periods, capitalized words, etc.
Computational Linguistics
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A maximum entropy approach to information extraction from semi-structured and free text
Eighteenth national conference on Artificial intelligence
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive multilingual sentence boundary disambiguation
Computational Linguistics
Statistical models for unsupervised prepositional phrase attachment
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Focused named entity recognition using machine learning
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to recognize tables in free text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A unified statistical model for the identification of English baseNP
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
An improved error model for noisy channel spelling correction
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Summarizing email conversations with clue words
Proceedings of the 16th international conference on World Wide Web
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Detection of e-mail concerning criminal activities using association rule-based decision tree
International Journal of Electronic Security and Digital Forensics
Discovering the Structures of Open Source Programs from Their Developer Mailing Lists
DS '09 Proceedings of the 12th International Conference on Discovery Science
Customer-focused service management for contact centers
IBM Journal of Research and Development
Automated email answering by text pattern matching
IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Tree-structured conditional random fields for semantic annotation
ISWC'06 Proceedings of the 5th international conference on The Semantic Web
A case study of using web search statistics: case restoration
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Extracting structured data from natural language documents with island parsing
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Content classification of development emails
Proceedings of the 34th International Conference on Software Engineering
The study of informality as a framework for evaluating the normalisation of web 2.0 texts
NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
Hi-index | 0.00 |
Addressed in this paper is the issue of 'email data cleaning' for text mining. Many text mining applications need take emails as input. Email data is usually noisy and thus it is necessary to clean it before mining. Several products offer email cleaning features, however, the types of noises that can be eliminated are restricted. Despite the importance of the problem, email cleaning has received little attention in the research community. A thorough and systematic investigation on the issue is thus needed. In this paper, email cleaning is formalized as a problem of non-text filtering and text normalization. In this way, email cleaning becomes independent from any specific text mining processing. A cascaded approach is proposed, which cleans up an email in four passes including non-text filtering, paragraph normalization, sentence normalization, and word normalization. As far as we know, non-text filtering and paragraph normalization have not been investigated previously. Methods for performing the tasks on the basis of Support Vector Machines (SVM) have also been proposed in this paper. Features in the models have been defined. Experimental results indicate that the proposed SVM based methods can significantly outperform the baseline methods for email cleaning. The proposed method has been applied to term extraction, a typical text mining processing. Experimental results show that the accuracy of term extraction can be significantly improved by using the data cleaning method.