A brief survey of web data extraction tools
ACM SIGMOD Record
The class imbalance problem in learning classifier systems: a preliminary study
GECCO '05 Proceedings of the 7th annual workshop on Genetic and evolutionary computation
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
Multimedia news exploration and retrieval by integrating keywords, relations and visual features
Multimedia Tools and Applications
Hi-index | 0.00 |
In this paper we propose an imbalanced classification algorithm to extract informative images from web news pages. Our algorithm resolve the difficult problem based on two approaches. First, we limit our dataset to a specific application area so that the patterns of the informative images can be captured by existing classification algorithms. Second, we propose an automatic negative samples filtering algorithm to eliminate most negative samples, so that the classification training data is rebalanced. Because most classification algorithms have reduced performance on imbalanced training data, our algorithm improves the overall performance significantly. In addition, our approach is inherently robust to new web sites and style/layout change of web sites.