Extracting content structure for web pages based on visual representation

Authors:
Deng Cai;Shipeng Yu;Ji-Rong Wen;Wei-Ying Ma
Affiliations:
Microsoft Research Asia, Tsinghua University, Beijing, P. R. China;Microsoft Research Asia, Peking University, Beijing, P. R. China;Microsoft Research Asia;Microsoft Research Asia
Venue:
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Year:
2003

Citing 13
Cited 68

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Two approaches to bringing Internet services to WAP devices

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Adding Structure to Unstructured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Engineering a multi-purpose test collection for web retrieval experiments

Information Processing and Management: an International Journal

Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Multi-model similarity propagation and its application for web image retrieval

Proceedings of the 12th annual ACM international conference on Multimedia
Extracting semantic structure of web documents using content and visual information

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A study on combination of block importance and relevance to estimate page relevance

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
As we may perceive: inferring logical documents from hypertext

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
A web browsing system based on adaptive presentation of web contents for cellular phones

W4A '06 Proceedings of the 2006 international cross-disciplinary workshop on Web accessibility (W4A): Building the mobile web: rediscovering accessibility?
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Combining DOM tree and geometric layout analysis for online medical journal article segmentation

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
MyPortal: robust extraction and aggregation of web content

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Clustering and searching WWW images using link and page layout analysis

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Extraction of flat and nested data records from web pages

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
A web page topic segmentation algorithm based on visual criteria and content layout

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Geo-tagging for imprecise regions of different sizes

Proceedings of the 4th ACM workshop on Geographical information retrieval
An automatic approach to construct domain-specific web portals

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A graph-theoretic approach to webpage segmentation

Proceedings of the 17th international conference on World Wide Web
Enhancing web page classification through image-block importance analysis

Information Processing and Management: an International Journal
Math information retrieval: user requirements and prototype implementation

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
Granular modeling of web documents: impact on information retrieval systems

Proceedings of the 10th ACM workshop on Web information and data management
Using a sentiment map for visualizing credibility of news sites on the web

Proceedings of the 2nd ACM workshop on Information credibility on the web
Browsing on small displays by transforming Web pages into hierarchically structured subpages

ACM Transactions on the Web (TWEB)
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Extracting the Latent Hierarchical Structure of Web Documents

Advanced Internet Based Systems and Applications
Refining search results using a mining framework

Expert Systems with Applications: An International Journal
A Structured Approach to Data Reverse Engineering of Web Applications

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Indexing by permeability in block structured web pages

Proceedings of the 9th ACM symposium on Document engineering
Probabilistic Relational Models with Relational Uncertainty: An Early Study in Web Page Classification

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Entropy-Based Visual Tree Evaluation on Block Extraction

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Learning document aboutness from implicit user feedback and document structure

Proceedings of the 18th ACM conference on Information and knowledge management
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Web data extracion using visual features

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
A probabilistic relational approach for web document clustering

Information Processing and Management: an International Journal
Enhancing web page readability for non-native readers

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Detecting visually similar Web pages: Application to phishing detection

ACM Transactions on Internet Technology (TOIT)
Clustering-based relevance feedback for web pages

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
A novel method of extracting and rendering news web sites on mobile devices

KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part I
Fair news reader: recommending news articles with different sentiments based on user preference

KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part I
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
An open source web browser for visually impaired

ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Finding and using the content texts of HTML pages

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
An effective method supporting data extraction and schema recognition on deep web

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
ObjectRunner: lightweight, targeted extraction and querying of structured web data

Proceedings of the VLDB Endowment
Identifying primary content from web pages and its application to web search ranking

Proceedings of the 20th international conference companion on World wide web
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter
Time-weighted web authoritative ranking

Information Retrieval
An approach to assess the quality of web pages in the deep web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Accessibility summarization & simplification in a template-based web transcoder

Journal of Web Engineering
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Towards a spatial instance learning method for deep web pages

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Block-based language modeling approach towards web search

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
User preference modeling based on interest and impressions for news portal site systems

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Cleaning web pages for effective web content mining

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Hybrid model of content extraction

Journal of Computer and System Sciences
MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques

Proceedings of the 21st international conference companion on World Wide Web
VisHue: web page segmentation for an improved query interface for medlineplus medical encyclopedia

DNIS'11 Proceedings of the 7th international conference on Databases in Networked Information Systems
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Measuring the Visual Complexities of Web Pages

ACM Transactions on the Web (TWEB)
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Identifying salient entities in web pages

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Domain specific multistage query language for medical document repositories

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure. Experiments show satisfactory results.