Extracting multiple news attributes based on visual features

Authors:
Wei Liu;Hualiang Yan;Jianguo Xiao
Affiliations:
Institute of Scientific and Technical Information of China, Beijing, China 100038;Institute of Computer Science & Technology, Peking University, Beijing, China 100871;Institute of Computer Science & Technology, Peking University, Beijing, China 100871
Venue:
Journal of Intelligent Information Systems
Year:
2012

Citing 23
Cited 0

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
C4.5: programs for machine learning

C4.5: programs for machine learning
A brief survey of web data extraction tools

ACM SIGMOD Record
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources

Journal of Intelligent Information Systems
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Markov logic networks

Machine Learning
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Dynamic hierarchical Markov random fields and their application to web data extraction

Proceedings of the 24th international conference on Machine learning
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
News article extraction with template-independent wrapper

Proceedings of the 18th international conference on World wide web
Web article extraction for web printing: a DOM+visual based approach

Proceedings of the 9th ACM symposium on Document engineering
Discriminative training of Markov logic networks

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Passage extraction and result combination for genomics information retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.