C4.5: programs for machine learning
C4.5: programs for machine learning
Information Retrieval
Constructing Biological Knowledge Bases by Extracting Information from Text Sources
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
A maximum entropy approach to named entity recognition
A maximum entropy approach to named entity recognition
A statistical profile of the Named Entity task
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Nymble: a high-performance learning name-finder
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Extracting the names of genes and gene products with a hidden Markov model
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
MUC5 '93 Proceedings of the 5th conference on Message understanding
Rutabaga by any other name: extracting biological names
Journal of Biomedical Informatics - Special issue: Sublanguage
Automatically identifying gene/protein terms in MEDLINE abstracts
Journal of Biomedical Informatics
Enhancing HMM-based biomedical named entity recognition by studying special phenomena
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Comparison of character-level and part of speech features for name recognition in biomedical texts
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Tuning support vector machines for biomedical named entity recognition
BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Use of support vector machines in extended named entity recognition
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Effective adaptation of a Hidden Markov Model-based named entity recognizer for biomedical domain
BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Hi-index | 0.00 |
We present two measures for comparing corpora based on information theory statistics such as gain ratio as well as simple term-class frequency counts. We tested the predictions made by these measures about corpus difficulty in two domains --- news and molecular biology --- using the result of two well-used paradigms for NE, decision trees and HMMs and found that gain ratio was the more reliable predictor.