CETR: content extraction via tag ratios

Authors:
Tim Weninger;William H. Hsu;Jiawei Han
Affiliations:
University of Illinois, Urbana, IL, USA;Kansas State University, Manhattan, KS, USA;University of Illinois, Urbana, IL, USA
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 27
Cited 11

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Accordion summarization for end-game browsing on PDAs and cellular phones

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
QuASM: a system for question answering using semi-structured data

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Wrapping web data into XML

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Misuse detection for information retrieval systems

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Automating Content Extraction of HTML Documents

World Wide Web
Extracting context to improve accuracy for HTML content extraction

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Separating XHTML content from navigation clutter using DOM-structure block analysis

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Text Extraction from the Web via Text-to-Tag Ratio

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Content Code Blurring: A New Approach to Content Extraction

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Toward 2W, beyond web 2.0

Communications of the ACM - Inspiring Women in Computing
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems

DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Unsupervised user-generated content extraction by dependency relationships

WISE'11 Proceedings of the 12th international conference on Web information system engineering
Semi-supervised multi-task learning of structured prediction models for web information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
Hybrid model of content extraction

Journal of Computer and System Sciences
TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Proceedings of the twelfth international workshop on Web information and data management
Exploring structure and content on the web: extraction and integration of the semi-structured web

Proceedings of the sixth ACM international conference on Web search and data mining
On text preprocessing for opinion mining outside of laboratory environments

AMT'12 Proceedings of the 8th international conference on Active Media Technology
Automatic Extraction of Blog Post from Diverse Blog Pages

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Echo: the editor's wisdom with the elegance of a magazine

Proceedings of the 5th ACM SIGCHI symposium on Engineering interactive computing systems
Content extraction using diverse feature sets

Proceedings of the 22nd international conference on World Wide Web companion
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.