A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
Journal of the ACM (JACM)
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
Structural extraction from visual layout of documents
Proceedings of the eleventh international conference on Information and knowledge management
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
WICCAP: From Semi-structured Data to Structured Data
ECBS '04 Proceedings of the 11th IEEE International Conference and Workshop on Engineering of Computer-Based Systems
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web
WWW '05 Proceedings of the 14th international conference on World Wide Web
Extracting web data using instance-based learning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Automatically maintaining navigation sequences for querying semi-structured web sources
Data & Knowledge Engineering
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
A Workflow-Based Approach for Creating Complex Web Wrappers
WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
Structure Extraction from Presentation Slide Information
PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
ODE: Ontology-assisted data extraction
ACM Transactions on Database Systems (TODS)
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting informative images from web news pages via imbalanced classification
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Information extraction for search engines using fast heuristic techniques
Data & Knowledge Engineering
Finding and Extracting Data Records from Web Pages
Journal of Signal Processing Systems
Finding and extracting data records from web pages
EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
Using clustering and edit distance techniques for automatic web data extraction
WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Blog post and comment extraction using information quantity of web format
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
An automatic HTTP cookie management system
Computer Networks: The International Journal of Computer and Telecommunications Networking
SXPath: extending XPath towards spatial querying on web documents
Proceedings of the VLDB Endowment
Extracting general lists from web documents: a hybrid approach
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Towards a spatial instance learning method for deep web pages
ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
Little knowledge rules the web: domain-centric result page extraction
RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
Towards a unified solution: data record region detection and segmentation
Proceedings of the 20th ACM international conference on Information and knowledge management
SILA: a spatial instance learning approach for deep webpages
Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Semantic entity-relationship model for large-scale multimedia news exploration and recommendation
MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
FoCUS: learning to crawl web forums
Proceedings of the 21st international conference companion on World Wide Web
Data extraction from web pages based on structural-semantic entropy
Proceedings of the 21st international conference companion on World Wide Web
AMBER: turning annotations into knowledge
Proceedings of the 21st international conference companion on World Wide Web
Automatically learning gazetteers from the deep web
Proceedings of the 21st international conference companion on World Wide Web
Building enriched web page representations using link paths
Proceedings of the 23rd ACM conference on Hypertext and social media
Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
TEX: An efficient and effective unsupervised Web information extractor
Knowledge-Based Systems
DEQA: deep web extraction for question answering
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Exploring structure and content on the web: extraction and integration of the semi-structured web
Proceedings of the sixth ACM international conference on Web search and data mining
A framework for learning web wrappers from the crowd
Proceedings of the 22nd international conference on World Wide Web
Structured positional entity language model for enterprise entity retrieval
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
The parallel path framework for entity discovery on the web
ACM Transactions on the Web (TWEB)
International Journal of Information Retrieval Research
Effects of Terms Recognition Mistakes on Requests Processing for Interactive Information Retrieval
International Journal of Information Retrieval Research
Hi-index | 0.01 |
This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective.