PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
An Optimization Methodology for Document Structure Extraction on Latin Character Documents
IEEE Transactions on Pattern Analysis and Machine Intelligence
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents
WWW '03 Proceedings of the 12th international conference on World Wide Web
Document Layout Structure Extraction Using Bounding Boxes of Different Entities
WACV '96 Proceedings of the 3rd IEEE Workshop on Applications of Computer Vision (WACV '96)
Recursive X-Y cut using bounding boxes of connected components
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
A Fully Automated Object Extraction System for the World Wide Web
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using visual cues for extraction of tabular data from arbitrary HTML documents
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Hi-index | 0.00 |
The statistical results of observations show that regular spatial distribution characteristics exist for Web information about objects of the same type across different Web sites. The spatial distance between components within one object is always less than that between different objects. A novel method based on spatial configuration of Web document to extract object from the World Wide Web is presented. It demonstrates a fully automatic bottom-up process of object extraction. This method primarily considers the distribution characteristic of Web information and is independent of underlying documentation representation, such as HTML code. Experiments show that the proposed method can work well even when the HTML structure is far different from layout structure, and the results are encouraging.