Extracting structured data from Web pages

Authors:
Arvind Arasu;Hector Garcia-Molina
Affiliations:
Stanford University, Palo Alto, CA;Stanford University, Palo Alto, CA
Venue:
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Year:
2003

Citing 14
Cited 196

The TSIMMIS Approach to Mediation: Data Models and Languages

Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
A brief survey of web data extraction tools

ACM SIGMOD Record
Information Integration Using Logical Views

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Optimizing Queries Across Diverse Data Sources

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Querying Heterogeneous Information Sources Using Source Descriptions

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Inductive Inference, DFAs, and Computational Complexity

AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering

On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Schema-guided wrapper maintenance for web-data extraction

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Fine-grain web site structure discovery

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
On the complexity of schema inference from web pages in the presence of nullable data attributes

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Extracting unstructured data from template generated web documents

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Towards building logical views of websites

Data & Knowledge Engineering - Special issue: WIDM 2002
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Information extraction using two-phase pattern discovery

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
OntoMiner: bootstrapping ontologies from overlapping domain specific web sites

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Tree-Structured Template Generation for Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
A two-phase sampling technique for information extraction from hidden web databases

Proceedings of the 6th annual ACM international workshop on Web information and data management
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Extracting relational data from HTML repositories

ACM SIGKDD Explorations Newsletter
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Browsing fatigue in handhelds: semantic bookmarking spells relief

WWW '05 Proceedings of the 14th international conference on World Wide Web
An information extraction engine for web discussion forums

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic wrapper maintenance for semi-structured web sources using results from previous queries

Proceedings of the 2005 ACM symposium on Applied computing
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web

IEEE Transactions on Knowledge and Data Engineering
The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents

VLDB '05 Proceedings of the 31st international conference on Very large data bases
AutoFeed: an unsupervised learning system for generating webfeeds

Proceedings of the 3rd international conference on Knowledge capture
Web data extraction based on structural similarity

Knowledge and Information Systems
Automatic Discovery and Inferencing of Complex Bioinformatics Web Interfaces

World Wide Web
Learning Object Models from Semistructured Web Documents

IEEE Transactions on Knowledge and Data Engineering
Adaptive web information extraction

Communications of the ACM - Two decades of the language-action perspective
OntoMiner: Bootstrapping and Populating Ontologies from Domain-Specific Web Sites

IEEE Intelligent Systems
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
L-tree match: a new data extraction model and algorithm for huge text stream with noises

Journal of Computer Science and Technology
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A two-phase rule generation and optimization approach for wrapper generation

ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Logical structure analysis: From HTML to XML

Computer Standards & Interfaces
An agent- and ontology-based system for integrating public gene, protein, and disease databases

Journal of Biomedical Informatics
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
Automatically maintaining wrappers for semi-structured web sources

Data & Knowledge Engineering
Information categorization in web pages and sites

Web Intelligence and Agent Systems
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Semantic deep web: automatic attribute extraction from the deep web data sources

Proceedings of the 2007 ACM symposium on Applied computing
Interactive Tuples Extraction from Semi-Structured Data

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Protection Techniques from Information Extraction

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Enabling Information Integration and Workflows in a Grid Environment with Automatic Wrapper Generation

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

World Wide Web
Extracting Web Data Using Instance-Based Learning

World Wide Web
Extraction of flat and nested data records from web pages

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Elimination of junk document surrogate candidates through pattern recognition

Proceedings of the 2007 ACM symposium on Document engineering
Automatically maintaining navigation sequences for querying semi-structured web sources

Data & Knowledge Engineering
Enhancing enterprise knowledge processes via cross-media extraction

Proceedings of the 4th international conference on Knowledge capture
Routing Queries through a Peer-to-Peer InfoBeacons Network Using Information Retrieval Techniques

IEEE Transactions on Parallel and Distributed Systems
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
An automatic data grabber for large web sites

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
From dirt to shovels: fully automatic tool generation from ad hoc data

Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Flint: Google-basing the Web

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
CCReSD: concept-based categorisation of Hidden Web databases

International Journal of High Performance Computing and Networking
OntoMiner: automated metadata and instance mining from news websites

International Journal of Web and Grid Services
LearnPADS: automatic tool generation from ad hoc data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pictor: an interactive system for importing data from a website

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Caravela: Semantic Content Management with Automatic Information Integration and Categorization (System Description)

ESWC '07 Proceedings of the 4th European conference on The Semantic Web: Research and Applications
A Workflow-Based Approach for Creating Complex Web Wrappers

WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

The Journal of Machine Learning Research
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Automated Semantic Analysis of Schematic Data

World Wide Web
Integrating web query results: holistic schema matching

Proceedings of the 17th ACM conference on Information and knowledge management
Supporting the automatic construction of entity aware search engines

Proceedings of the 10th ACM workshop on Web information and data management
Information Extraction

Foundations and Trends in Databases
Ad Hoc Data and the Token Ambiguity Problem

PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Grubber: Allowing End-Users to Develop XML-Based Wrappers for Web Data Sources

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting structured information from user queries with semi-supervised conditional random fields

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Personal News RSS Feeds Generation Using Existing News Feeds

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
Overview of autofeed: an unsupervised learning system for generating webfeeds

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Automated document metadata extraction

Journal of Information Science
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Managing knowledge on the Web - Extracting ontology from HTML Web

Decision Support Systems
Constructing Event Templates from Written News

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Automatic web data extraction using tree alignment

Proceedings of the 18th ACM conference on Information and knowledge management
Web news categorization using a cross-media document graph

Proceedings of the ACM International Conference on Image and Video Retrieval
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
FastWrap: an efficient wrapper for tabular data extraction from the web

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Finding and Extracting Data Records from Web Pages

Journal of Signal Processing Systems
Automatic extraction of clickable structured web contents for name entity queries

Proceedings of the 19th international conference on World wide web
Not so creepy crawler: easy crawler generation with standard xml queries

Proceedings of the 19th international conference on World wide web
Wikipedia driven autonomous label assignment in wrapper induced tables with missing column names

Proceedings of the 2010 ACM Symposium on Applied Computing
Building a scalable web query system

DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
From database to semantic web ontology: an overview

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems - Volume Part II
Finding and extracting data records from web pages

EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
Querying capability modeling and construction of deep web sources

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Using clustering and edit distance techniques for automatic web data extraction

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Automatic hidden-web table interpretation by sibling page comparison

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Labeling data extracted from the web

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
An effective method supporting data extraction and schema recognition on deep web

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Web data extraction system based on label library

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
No Code Required: Giving Users Tools to Transform the Web

No Code Required: Giving Users Tools to Transform the Web
A context-free markup language for semi-structured text

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Tag tree template for Web information and schema extraction

Expert Systems with Applications: An International Journal
Redundancy-driven web data extraction and integration

Procceedings of the 13th International Workshop on the Web and Databases
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A unified approach for extracting multiple news attributes from news pages

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
ObjectRunner: lightweight, targeted extraction and querying of structured web data

Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Find this for me: mobile information retrieval on the open web

Proceedings of the 16th international conference on Intelligent user interfaces
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 20th international conference on World wide web
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Federated Search

Foundations and Trends in Information Retrieval
A Bayesian network modeling approach for cross media analysis

Image Communication
Wrangler: interactive visual specification of data transformation scripts

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A framework for automatic annotation of web pages using the Google rich snippets vocabulary

Proceedings of the 2011 ACM Symposium on Applied Computing
An approach to assess the quality of web pages in the deep web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
From one tree to a forest: a unified solution for structured web data extraction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Accelerating dynamic web content delivery using keyword-based fragment detection

Journal of Web Engineering
Unsupervised user-generated content extraction by dependency relationships

WISE'11 Proceedings of the 12th international conference on Web information system engineering
Wrapper Generation for Overlapping Web Sources

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Extract knowledge from semi-structured websites for search task simplification

Proceedings of the 20th ACM international conference on Information and knowledge management
Exploiting attribute redundancy for web entity data extraction

ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
A query rewriting system for enhancing the queriability of form-based interface

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Clustering-based schema matching of web data for constructing digital library

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
PIES: a web information extraction system using ontology and tag patterns

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Ontology-based HTML to XML conversion

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Decomposition-Based optimization of reload strategies in the world wide web

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Improving web data annotations with spreading activation

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Semantic partitioning of web pages

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Image description mining and hierarchical clustering on data records using HR-Tree

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Information extraction from semi-structured web documents

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
RecipeCrawler: collecting recipe data from WWW incrementally

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
CCWrapper: adaptive predefined schema guided web extraction

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
WDEE: web data extraction by example

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Automatic data extraction from data-rich web pages

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
A semantic enrichment of data tables applied to food risk assessment

DS'05 Proceedings of the 8th international conference on Discovery Science
Learning layouts of biological datasets semi-automatically

DILS'05 Proceedings of the Second international conference on Data Integration in the Life Sciences
Preloading browsers for optimizing automatic access to hidden web: a ranking-based repository solution

ADBIS'06 Proceedings of the 10th East European conference on Advances in Databases and Information Systems
Optimization of automatic navigation to hidden web pages by ranking-based browser preloading

DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
Maintaining web navigation flows for wrappers

DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
Chapter 6: web data extraction for service creation

Search Computing
An analysis of structured data on the web

Proceedings of the VLDB Endowment
LearnPADS++: incremental inference of ad hoc data formats

PADL'12 Proceedings of the 14th international conference on Practical Aspects of Declarative Languages
Data extraction from web pages based on structural-semantic entropy

Proceedings of the 21st international conference companion on World Wide Web
AMBER: turning annotations into knowledge

Proceedings of the 21st international conference companion on World Wide Web
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
Retrieving informative content from web pages with conditional learning of support vector machines and semantic analysis

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Self-supervised learning approach for extracting citation information on the web

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Research directions in data wrangling: visuatizations and transformations for usable and credible data

Information Visualization - Special issue on State of the Field and New Research Directions
LIEGE:: link entities in web lists with knowledge base

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Peer matrix alignment: a new algorithm

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Measuring structural similarity of semistructured data based on information-theoretic approaches

The VLDB Journal — The International Journal on Very Large Data Bases
Learning to perceive two-dimensional displays using probabilistic grammars

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
An unsupervised technique to extract information from semi-structured web pages

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Towards discovering ontological models from big RDF data

ER'12 Proceedings of the 2012 international conference on Advances in Conceptual Modeling
Towards discovering conceptual models behind web sites

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Unsupervised wrapper induction using linked data

Proceedings of the seventh international conference on Knowledge capture
Discovering interesting information with advances in web technology

ACM SIGKDD Explorations Newsletter
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
A framework for learning web wrappers from the crowd

Proceedings of the 22nd international conference on World Wide Web
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
Exploiting a proximity-based positional model to improve the quality of information extraction by text segmentation

ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137
Discovering implicit schemas in JSON data

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Extraction and integration of partially overlapping web sources

Proceedings of the VLDB Endowment
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems
Agreement based source selection for the multi-topic deep web integration

Proceedings of the 17th International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.