Extracting article text from the web with maximum subsequence segmentation

Authors:
Jeff Pasternack;Dan Roth
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 18th international conference on World wide web
Year:
2009

Citing 20
Cited 19

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Accordion summarization for end-game browsing on PDAs and cellular phones

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
A Linear Time Algorithm for Finding All Maximal Scoring Subsequences

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting unstructured data from template generated web documents

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Tracking and summarizing news on a daily basis with Columbia's Newsblaster

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Learning and inference over constrained output

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Experience with Top Gun Wingman: a proxy-based graphical web browser for the 3Com PalmPilot

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

Web article extraction for web printing: a DOM+visual based approach

Proceedings of the 9th ACM symposium on Document engineering
Web document text and images extraction using DOM analysis and natural language processing

Proceedings of the 9th ACM symposium on Document engineering
Cross-cultural analysis of blogs and forums with mixed-collection topic models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
Automatic selection of print-worthy content for enhanced web page printing experience

Proceedings of the 10th ACM symposium on Document engineering
A generic approach for on-the-fly adding of context-aware features to existing websites

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
A very efficient approach to news title and content extraction on the web

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Article clipper: a system for web article extraction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Print-friendly page extraction for web printing service

Proceedings of the 11th ACM symposium on Document engineering
An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Hybrid model of content extraction

Journal of Computer and System Sciences
Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems
Harnessing the wisdom of the crowds for accurate web page clipping

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
On text preprocessing for opinion mining outside of laboratory environments

AMT'12 Proceedings of the 8th international conference on Active Media Technology
Automatic Extraction of Blog Post from Diverse Blog Pages

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Content extraction using diverse feature sets

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content from the original HTML document is complicated by the large amount of less informative and typically unrelated material such as navigation menus, forms, user comments, and ads. Existing approaches tend to be either brittle and demand significant expert knowledge and time (manual or tool-assisted generation of rules or code), necessitate labeled examples for every different page structure to be processed (wrapper induction), require relatively uniform layout (template detection), or, as with Visual Page Segmentation (VIPS), are computationally expensive. We introduce maximum subsequence segmentation, a method of global optimization over token-level local classifiers, and apply it to the domain of news websites. Training examples are easy to obtain, both learning and prediction are linear time, and results are excellent (our semi-supervised algorithm yields an overall F1-score of 97.947%), surpassing even those produced by VIPS with a hypothetical perfect block-selection heuristic. We also evaluate against the recent CleanEval shared task with surprisingly good cross-task performance cleaning general web pages, exceeding the top "text-only" score (based on Levenshtein distance), 87.8% versus 84.1%.