Cuisine: Classification using stylistic feature sets and-or name-based feature sets

Authors:
Yaakov HaCohen-Kerner;Hananya Beck;Elchai Yehudai;Mordechay Rosenstein;Dror Mughaz
Affiliations:
Department of Computer Science, Jerusalem College of Technology (Machon Lev), 21 Havaad Haleumi Street, P.O.B. 16031, 91160 Jerusalem, Israel;Department of Computer Science, Jerusalem College of Technology (Machon Lev), 21 Havaad Haleumi Street, P.O.B. 16031, 91160 Jerusalem, Israel;Department of Computer Science, Jerusalem College of Technology (Machon Lev), 21 Havaad Haleumi Street, P.O.B. 16031, 91160 Jerusalem, Israel;Department of Computer Science, Jerusalem College of Technology (Machon Lev), 21 Havaad Haleumi Street, P.O.B. 16031, 91160 Jerusalem, Israel;Department of Computer Science, Bar-Ilan University, 52900 Ramat-Gan, Israel and Department of Computer Science, Jerusalem College of Technology (Machon Lev)
Venue:
Journal of the American Society for Information Science and Technology
Year:
2010

Citing 33
Cited 1

The nature of statistical learning theory

The nature of statistical learning theory
Support-Vector Networks

Machine Learning
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Learning to remove Internet advertisements

Proceedings of the third annual conference on Autonomous Agents
Extending naïve Bayes classifiers using long itemsets

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Mining online text

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Mining e-mail content for author identification forensics

ACM SIGMOD Record
International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology

SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
Genre Classification and Domain Transfer for Information Filtering

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Authorship Attribution with Support Vector Machines

Applied Intelligence
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
The disputed federalist papers: SVM feature selection via concave minimization

Proceedings of the 2003 conference on Diversity in computing
Hebrew Computational Linguistics: Past and Future

Artificial Intelligence Review
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Anti-aliasing on the web

Proceedings of the 13th international conference on World Wide Web
Recognizing text genres with simple metrics using discriminant analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
An evaluation of statistical spam filtering techniques

ACM Transactions on Asian Language Information Processing (TALIP)
Improving performance of text categorization by combining filtering and support vector machines: Research Articles

Journal of the American Society for Information Science and Technology
Language and task independent text categorization with simple language models

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Thumbs up?: sentiment classification using machine learning techniques

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Stylistic text classification using functional lexical features: Research Articles

Journal of the American Society for Information Science and Technology
Author identification: Using text sampling to handle the class imbalance problem

Information Processing and Management: an International Journal
Abbreviation Disambiguation: Experiments with Various Variants of the One Sense per Discourse Hypothesis

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Combined one sense disambiguation of abbreviations

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
The COMPSET algorithm for subset selection

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Multiple sets of features for automatic genre classification of web documents

Information Processing and Management: an International Journal
Identifying historical period and ethnic origin of documents using stylistic feature sets

DS'06 Proceedings of the 9th international conference on Discovery Science

Estimating the birth and death years of authors of undated documents using undated citations

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document classification presents challenges due to the large number of features, their dependencies, and the large number of training documents. In this research, we investigated the use of six stylistic feature sets (including 42 features) and-or six name-based feature sets (including 234 features) for various combinations of the following classification tasks: ethnic groups of the authors and-or periods of time when the documents were written and-or places where the documents were written. The investigated corpus contains Jewish Law articles written in Hebrew–Aramaic, which present interesting problems for classification. Our system CUISINE (Classification UsIng Stylistic feature sets and-or NamE-based feature sets) achieves accuracy results between 90.71 to 98.99% for the seven classification experiments (ethnicity, time, place, ethnicity&time, ethnicity&place, time&place, ethnicity&time&place). For the first six tasks, the stylistic feature sets in general and the quantitative feature set in particular are enough for excellent classification results. In contrast, the name-based feature sets are rather poor for these tasks. However, for the most complex task (ethnicity&time&place), a hill-climbing model using all feature sets succeeds in significantly improving the classification results. Most of the stylistic features (34 of 42) are language-independent and domain-independent. These features might be useful to the community at large, at least for rather simple tasks. © 2010 Wiley Periodicals, Inc.