Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
Identifying syntactic differences between two programs
Software—Practice & Experience
On the editing distance between unordered labeled trees
Information Processing Letters
Approximate tree matching in the presence of variable length don't cares
Journal of Algorithms
An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees
IEEE Transactions on Pattern Analysis and Machine Intelligence
Database techniques for the World-Wide Web: a survey
ACM SIGMOD Record
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Wrapping-oriented classification of web pages
Proceedings of the 2002 ACM symposium on Applied computing
ChangeDetector: a site-level monitoring tool for the WWW
Proceedings of the 11th international conference on World Wide Web
New algorithm for ordered tree-to-tree correction problem
Journal of Algorithms
Modern Information Retrieval
A brief survey of web data extraction tools
ACM SIGMOD Record
Mercator: A scalable, extensible Web crawler
World Wide Web
Efficient extraction of schemas for XML documents
Information Processing Letters
Comparing Hierarchical Data in External Memory
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
On the complexity of schema inference from web pages in the presence of nullable data attributes
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Pure Multiple RNA Secondary Structure Alignments: A Progressive Profile Approach
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Editorial: special issue on web content mining
ACM SIGKDD Explorations Newsletter
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web
WWW '05 Proceedings of the 14th international conference on World Wide Web
WWW '05 Proceedings of the 14th international conference on World Wide Web
The anatomy of a news search engine
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
AutoFeed: an unsupervised learning system for generating webfeeds
Proceedings of the 3rd international conference on Knowledge capture
An e-market framework for informed trading
Proceedings of the 15th international conference on World Wide Web
GoGetIt!: a tool for generating structure-driven web crawlers
Proceedings of the 15th international conference on World Wide Web
Proceedings of the 2006 ACM symposium on Applied computing
Template detection for large scale search engines
Proceedings of the 2006 ACM symposium on Applied computing
Structure-driven crawler generation by example
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic extraction of dynamic record sections from search engine result pages
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Enabling web browsers to augment web sites' filtering and sorting functionalities
UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page title extraction and its application
Information Processing and Management: an International Journal
Supporting end-users in the creation of dependable web clips
Proceedings of the 16th international conference on World Wide Web
Adaptive record extraction from web pages
Proceedings of the 16th international conference on World Wide Web
PageTailor: reusable end-user customization for the mobile web
Proceedings of the 5th international conference on Mobile systems, applications and services
FLUX-CIM: flexible unsupervised extraction of citation metadata
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Web services discovery based on schema matching
ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Joint optimization of wrapper generation and template detection
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Measuring the structural similarity of semistructured documents using entropy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Web Page Analysis: Experiments Based on Discussion and Purchase Web Patterns
WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Perception-oriented online news extraction
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Detecting data records in semi-structured web sites based on text token clustering
Integrated Computer-Aided Engineering
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Exploring websites through contextual facets
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Characterizing insecure javascript practices on the web
Proceedings of the 18th international conference on World wide web
News article extraction with template-independent wrapper
Proceedings of the 18th international conference on World wide web
On Finding Templates on Web Collections
World Wide Web
Semantic Annotation of Web Pages Using Web Patterns
Advanced Internet Based Systems and Applications
Can we learn a template-independent wrapper for news article extraction from a single training site?
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient IR-Style Search over Web Services
CAiSE '09 Proceedings of the 21st International Conference on Advanced Information Systems Engineering
Personal News RSS Feeds Generation Using Existing News Feeds
ICWE '9 Proceedings of the 9th International Conference on Web Engineering
A Layout-Independent Web News Article Contents Extraction Method Based on Relevance Analysis
ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Learning to Extract Web News Title in Template Independent Way
RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
Xdiff+: a visualization system for XML documents and Schemata
Proceedings of the 46th Annual Southeast Regional Conference on XX
Overview of autofeed: an unsupervised learning system for generating webfeeds
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Deriving image-text document surrogates to optimize cognition
Proceedings of the 9th ACM symposium on Document engineering
Site-Wide Wrapper Induction for Life Science Deep Web Databases
DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Template-independent news extraction based on visual consistency
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Distilling Informative Content from HTML News Pages
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Efficient record-level wrapper induction
Proceedings of the 18th ACM conference on Information and knowledge management
Automatic web data extraction using tree alignment
Proceedings of the 18th ACM conference on Information and knowledge management
A fast and simple method for extracting relevant content from news webpages
Proceedings of the 18th ACM conference on Information and knowledge management
A fuzzy extension of the XPath query language
Journal of Intelligent Information Systems
A Bloom Filter Based Approach for Evaluating Structural Similarity of XML Documents
WISM '09 Proceedings of the International Conference on Web Information Systems and Mining
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
An adaptive bottom up clustering approach for web news extraction
WOCC'09 Proceedings of the 18th international conference on Wireless and Optical Communications Conference
WSXplorer: searching for desired web services
CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
Using structured tokens to identify webpages for data extraction
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
The paths more taken: matching DOM trees to search logs for accurate webpage clustering
Proceedings of the 19th international conference on World wide web
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
NowOnWeb: news search and summarization
EUROCAST'07 Proceedings of the 11th international conference on Computer aided systems theory
Web mediators for accessible browsing
ERCIM'06 Proceedings of the 9th conference on User interfaces for all
Labeling data extracted from the web
OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
Web pages reordering and clustering based on web patterns
SOFSEM'08 Proceedings of the 34th conference on Current trends in theory and practice of computer science
Blog post and comment extraction using information quantity of web format
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Web news extraction based on path pattern mining
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Web services discovery and rank: An information retrieval approach
Future Generation Computer Systems
An automatic HTTP cookie management system
Computer Networks: The International Journal of Computer and Telecommunications Networking
Automatic extraction of web data records containing user-generated content
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A unified approach for extracting multiple news attributes from news pages
PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Building an electronic market system
IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part II
A very efficient approach to news title and content extraction on the web
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An automatic web news article contents extraction system based on RSS feeds
Journal of Web Engineering
Print-friendly page extraction for web printing service
Proceedings of the 11th ACM symposium on Document engineering
BLOCK: a black-box approach for detection of state violation attacks towards web applications
Proceedings of the 27th Annual Computer Security Applications Conference
Hybrid method for automated news content extraction from the web
WISE'06 Proceedings of the 7th international conference on Web Information Systems
RTED: a robust algorithm for the tree edit distance
Proceedings of the VLDB Endowment
NET – a system for extracting web data from flat and nested data records
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Approximate top-k structural similarity search over XML documents
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Fast approximate matching between XML documents and schemata
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Making informed automated trading a reality
EC-Web'06 Proceedings of the 7th international conference on E-Commerce and Web Technologies
Refining the results of automatic e-textbook construction by clustering
ICWL'05 Proceedings of the 4th international conference on Advances in Web-Based Learning
RecipeCrawler: collecting recipe data from WWW incrementally
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Agents for information-rich environments
CIA'06 Proceedings of the 10th international conference on Cooperative Information Agents
Multi-lingual detection of terrorist content on the web
WISI'06 Proceedings of the 2006 international conference on Intelligence and Security Informatics
Extracting multiple news attributes based on visual features
Journal of Intelligent Information Systems
FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Sift: an end-user tool for gathering web content on the go
Proceedings of the 2012 ACM symposium on Document engineering
Measuring structural similarity of semistructured data based on information-theoretic approaches
The VLDB Journal — The International Journal on Very Large Data Bases
Robust web data extraction: a novel approach based on minimum cost script edit model
WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
RUBIX: a framework for improving data integration with linked data
Proceedings of the First International Workshop on Open Data
A measurement study of insecure javascript practices on the web
ACM Transactions on the Web (TWEB)
Indexing for subtree similarity-search using edit distance
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cluster-based page segmentation-a fast and precise method for web page pre-processing
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Locality sensitive hashing for scalable structural classification and clustering of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Automated cookie collection testing
ACM Transactions on Software Engineering and Methodology (TOSEM)
CALA: An unsupervised URL-based web page classification system
Knowledge-Based Systems
Hi-index | 0.00 |
The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results.In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.