SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
Accordion summarization for end-game browsing on PDAs and cellular phones
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Function-based object model towards website adaptation
Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools
ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
A Linear Time Algorithm for Finding All Maximal Scoring Subsequences
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving pseudo-relevance feedback in web information retrieval using web page segmentation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting unstructured data from template generated web documents
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection for large scale search engines
Proceedings of the 2006 ACM symposium on Applied computing
Tracking and summarizing news on a daily basis with Columbia's Newsblaster
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Web page cleaning for web mining through feature weighting
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Learning and inference over constrained output
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Experience with Top Gun Wingman: a proxy-based graphical web browser for the 3Com PalmPilot
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Web article extraction for web printing: a DOM+visual based approach
Proceedings of the 9th ACM symposium on Document engineering
Web document text and images extraction using DOM analysis and natural language processing
Proceedings of the 9th ACM symposium on Document engineering
Cross-cultural analysis of blogs and forums with mixed-collection topic models
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
Document structure meets page layout: loopy random fields for web news content extraction
Proceedings of the 10th ACM symposium on Document engineering
Automatic selection of print-worthy content for enhanced web page printing experience
Proceedings of the 10th ACM symposium on Document engineering
A generic approach for on-the-fly adding of context-aware features to existing websites
Proceedings of the 22nd ACM conference on Hypertext and hypermedia
A very efficient approach to news title and content extraction on the web
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Article clipper: a system for web article extraction
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Print-friendly page extraction for web printing service
Proceedings of the 11th ACM symposium on Document engineering
An efficient language-independent method to extract content from news webpages
Proceedings of the 11th ACM symposium on Document engineering
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Hybrid model of content extraction
Journal of Computer and System Sciences
Extracting multiple news attributes based on visual features
Journal of Intelligent Information Systems
Harnessing the wisdom of the crowds for accurate web page clipping
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
On text preprocessing for opinion mining outside of laboratory environments
AMT'12 Proceedings of the 8th international conference on Active Media Technology
Automatic Extraction of Blog Post from Diverse Blog Pages
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Content extraction using diverse feature sets
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content from the original HTML document is complicated by the large amount of less informative and typically unrelated material such as navigation menus, forms, user comments, and ads. Existing approaches tend to be either brittle and demand significant expert knowledge and time (manual or tool-assisted generation of rules or code), necessitate labeled examples for every different page structure to be processed (wrapper induction), require relatively uniform layout (template detection), or, as with Visual Page Segmentation (VIPS), are computationally expensive. We introduce maximum subsequence segmentation, a method of global optimization over token-level local classifiers, and apply it to the domain of news websites. Training examples are easy to obtain, both learning and prediction are linear time, and results are excellent (our semi-supervised algorithm yields an overall F1-score of 97.947%), surpassing even those produced by VIPS with a hypothetical perfect block-selection heuristic. We also evaluate against the recent CleanEval shared task with surprisingly good cross-task performance cleaning general web pages, exceeding the top "text-only" score (based on Levenshtein distance), 87.8% versus 84.1%.