Automatic web news extraction using tree edit distance

Authors:
D. C. Reis;P. B. Golgher;A. S. Silva;A. F. Laender
Affiliations:
Federal University of Minas Gerais, Belo Horizonte, Brazil;Akwan Information Technologies, Belo Horizonte, Brazil;Federal University of Amazonas, Manaus, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil
Venue:
Proceedings of the 13th international conference on World Wide Web
Year:
2004

Citing 20
Cited 98

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Identifying syntactic differences between two programs

Software—Practice & Experience
On the editing distance between unordered labeled trees

Information Processing Letters
Approximate tree matching in the presence of variable length don't cares

Journal of Algorithms
An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees

IEEE Transactions on Pattern Analysis and Machine Intelligence
Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Wrapping-oriented classification of web pages

Proceedings of the 2002 ACM symposium on Applied computing
ChangeDetector: a site-level monitoring tool for the WWW

Proceedings of the 11th international conference on World Wide Web
New algorithm for ordered tree-to-tree correction problem

Journal of Algorithms
Modern Information Retrieval

Modern Information Retrieval
A brief survey of web data extraction tools

ACM SIGMOD Record
Mercator: A scalable, extensible Web crawler

World Wide Web
Efficient extraction of schemas for XML documents

Information Processing Letters
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
On the complexity of schema inference from web pages in the presence of nullable data attributes

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management

Pure Multiple RNA Secondary Structure Alignments: A Progressive Profile Approach

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Ranking a stream of news

WWW '05 Proceedings of the 14th international conference on World Wide Web
The anatomy of a news search engine

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
AutoFeed: an unsupervised learning system for generating webfeeds

Proceedings of the 3rd international conference on Knowledge capture
An e-market framework for informed trading

Proceedings of the 15th international conference on World Wide Web
GoGetIt!: a tool for generating structure-driven web crawlers

Proceedings of the 15th international conference on World Wide Web
Computing edit distances between an XML document and a schema and its application in document classification

Proceedings of the 2006 ACM symposium on Applied computing
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Enabling web browsers to augment web sites' filtering and sorting functionalities

UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page title extraction and its application

Information Processing and Management: an International Journal
Supporting end-users in the creation of dependable web clips

Proceedings of the 16th international conference on World Wide Web
Adaptive record extraction from web pages

Proceedings of the 16th international conference on World Wide Web
PageTailor: reusable end-user customization for the mobile web

Proceedings of the 5th international conference on Mobile systems, applications and services
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Web services discovery based on schema matching

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Web Page Analysis: Experiments Based on Discussion and Purchase Web Patterns

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Perception-oriented online news extraction

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Exploring websites through contextual facets

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Characterizing insecure javascript practices on the web

Proceedings of the 18th international conference on World wide web
News article extraction with template-independent wrapper

Proceedings of the 18th international conference on World wide web
On Finding Templates on Web Collections

World Wide Web
Semantic Annotation of Web Pages Using Web Patterns

Advanced Internet Based Systems and Applications
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient IR-Style Search over Web Services

CAiSE '09 Proceedings of the 21st International Conference on Advanced Information Systems Engineering
Personal News RSS Feeds Generation Using Existing News Feeds

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Learning to Extract Web News Title in Template Independent Way

RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
Xdiff+: a visualization system for XML documents and Schemata

Proceedings of the 46th Annual Southeast Regional Conference on XX
Overview of autofeed: an unsupervised learning system for generating webfeeds

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
Site-Wide Wrapper Induction for Life Science Deep Web Databases

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Distilling Informative Content from HTML News Pages

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Automatic web data extraction using tree alignment

Proceedings of the 18th ACM conference on Information and knowledge management
A fast and simple method for extracting relevant content from news webpages

Proceedings of the 18th ACM conference on Information and knowledge management
A fuzzy extension of the XPath query language

Journal of Intelligent Information Systems
A Bloom Filter Based Approach for Evaluating Structural Similarity of XML Documents

WISM '09 Proceedings of the International Conference on Web Information Systems and Mining
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
An adaptive bottom up clustering approach for web news extraction

WOCC'09 Proceedings of the 18th international conference on Wireless and Optical Communications Conference
WSXplorer: searching for desired web services

CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
The paths more taken: matching DOM trees to search logs for accurate webpage clustering

Proceedings of the 19th international conference on World wide web
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
NowOnWeb: news search and summarization

EUROCAST'07 Proceedings of the 11th international conference on Computer aided systems theory
Web mediators for accessible browsing

ERCIM'06 Proceedings of the 9th conference on User interfaces for all
Labeling data extracted from the web

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
Web pages reordering and clustering based on web patterns

SOFSEM'08 Proceedings of the 34th conference on Current trends in theory and practice of computer science
Blog post and comment extraction using information quantity of web format

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Web news extraction based on path pattern mining

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Web services discovery and rank: An information retrieval approach

Future Generation Computer Systems
An automatic HTTP cookie management system

Computer Networks: The International Journal of Computer and Telecommunications Networking
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A unified approach for extracting multiple news attributes from news pages

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Building an electronic market system

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part II
A very efficient approach to news title and content extraction on the web

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An automatic web news article contents extraction system based on RSS feeds

Journal of Web Engineering
Print-friendly page extraction for web printing service

Proceedings of the 11th ACM symposium on Document engineering
BLOCK: a black-box approach for detection of state violation attacks towards web applications

Proceedings of the 27th Annual Computer Security Applications Conference
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
RTED: a robust algorithm for the tree edit distance

Proceedings of the VLDB Endowment
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Approximate top-k structural similarity search over XML documents

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Fast approximate matching between XML documents and schemata

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Making informed automated trading a reality

EC-Web'06 Proceedings of the 7th international conference on E-Commerce and Web Technologies
Refining the results of automatic e-textbook construction by clustering

ICWL'05 Proceedings of the 4th international conference on Advances in Web-Based Learning
RecipeCrawler: collecting recipe data from WWW incrementally

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Agents for information-rich environments

CIA'06 Proceedings of the 10th international conference on Cooperative Information Agents
Multi-lingual detection of terrorist content on the web

WISI'06 Proceedings of the 2006 international conference on Intelligence and Security Informatics
Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
Retrieving informative content from web pages with conditional learning of support vector machines and semantic analysis

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Sift: an end-user tool for gathering web content on the go

Proceedings of the 2012 ACM symposium on Document engineering
Measuring structural similarity of semistructured data based on information-theoretic approaches

The VLDB Journal — The International Journal on Very Large Data Bases
Robust web data extraction: a novel approach based on minimum cost script edit model

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
RUBIX: a framework for improving data integration with linked data

Proceedings of the First International Workshop on Open Data
A measurement study of insecure javascript practices on the web

ACM Transactions on the Web (TWEB)
Indexing for subtree similarity-search using edit distance

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cluster-based page segmentation-a fast and precise method for web page pre-processing

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Locality sensitive hashing for scalable structural classification and clustering of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Automated cookie collection testing

ACM Transactions on Software Engineering and Methodology (TOSEM)
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results.In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.