Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
C4.5: programs for machine learning
C4.5: programs for machine learning
A brief survey of web data extraction tools
ACM SIGMOD Record
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Journal of Intelligent Information Systems
ViPER: augmenting automatic information extraction with visual perceptions
Proceedings of the 14th ACM international conference on Information and knowledge management
2D Conditional Random Fields for Web information extraction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Machine Learning
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Dynamic hierarchical Markov random fields and their application to web data extraction
Proceedings of the 24th international conference on Machine learning
Incorporating site-level knowledge to extract structured data from web forums
Proceedings of the 18th international conference on World wide web
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
News article extraction with template-independent wrapper
Proceedings of the 18th international conference on World wide web
Web article extraction for web printing: a DOM+visual based approach
Proceedings of the 9th ACM symposium on Document engineering
Discriminative training of Markov logic networks
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Template-independent news extraction based on visual consistency
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Passage extraction and result combination for genomics information retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.