Record-boundary discovery in Web documents

Authors:
D. W. Embley;Y. Jiang;Y.-K. Ng
Affiliations:
Dept. of Computer Science, Brigham Young University, Provo, Utah;Dept. of Computer Science, Brigham Young University, Provo, Utah;Dept. of Computer Science, Brigham Young University, Provo, Utah
Venue:
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Year:
1999

Citing 10
Cited 88

Artificial intelligence (2nd ed.): structures and strategies for complex problem-solving

Artificial intelligence (2nd ed.): structures and strategies for complex problem-solving
Cut and paste

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
Virtual database technology

ACM SIGMOD Record
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Ontology-based extraction and structuring of information from data-rich unstructured documents

Proceedings of the seventh international conference on Information and knowledge management
Adding Structure to Unstructured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Semi-Automatic Wrapper Generation for Internet Information Sources

COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
A Conceptual-Modeling Approach to Extracting Data from the Web

ER '98 Proceedings of the 17th International Conference on Conceptual Modeling

Learning to extract hierarchical information from semi-structured documents

Proceedings of the ninth international conference on Information and knowledge management
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A brief survey of web data extraction tools

ACM SIGMOD Record
DEByE - Date extraction by example

Data & Knowledge Engineering
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
Extracting Information from Semi-structured Web Documents

OOIS '02 Proceedings of the Workshops on Advances in Object-Oriented Information Systems
Toward Learning Based Web Query Processing

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Extracting Information from Semistructured Data

WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Recognition of Common Areas in a Web Page Using a Visualization Approach

AIMSA '02 Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications
Applying Pattern Mining to Web Information Extraction

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Recognizing Ontology-Applicable Multiple-Record Web Documents

ER '01 Proceedings of the 20th International Conference on Conceptual Modeling: Conceptual Modeling
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Automatic Wrapper Generation for Multilingual Web Resources

DS '02 Proceedings of the 5th International Conference on Discovery Science
SCOOP: A Record Extractor without Knowledge on Input

DS '01 Proceedings of the 4th International Conference on Discovery Science
Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

DS '01 Proceedings of the 4th International Conference on Discovery Science
Information Extraction - Tree Alignment Approach to Pattern Discovery in Web Documents

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Document classification via structure synopses

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

World Wide Web
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

Information Retrieval
On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Hearsay: enabling audio browsing on hypertext content

Proceedings of the 13th international conference on World Wide Web
Learning effective ranking functions for newsgroup search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Toward semantic understanding: an approach based on information extraction ontologies

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic composite wrapper generation for semi-structured biological data based on table structure identification

ACM SIGMOD Record
Tree-Structured Template Generation for Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Structured databases on the web: observations and implications

ACM SIGMOD Record
Mining Web Pages for Data Records

IEEE Intelligent Systems
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
Bootstrapping Semantic Annotation for Content-Rich HTML Documents

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
A study on combination of block importance and relevance to estimate page relevance

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automating the extraction of data from HTML tables with unknown structure

Data & Knowledge Engineering - Special issue: ER 2002
STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

IEEE Transactions on Knowledge and Data Engineering
Adaptive web information extraction

Communications of the ACM - Two decades of the language-action perspective
A web browsing system based on adaptive presentation of web contents for cellular phones

W4A '06 Proceedings of the 2006 international cross-disciplinary workshop on Web accessibility (W4A): Building the mobile web: rediscovering accessibility?
L-tree match: a new data extraction model and algorithm for huge text stream with noises

Journal of Computer Science and Technology
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Exploiting types for improved schema mapping

Proceedings of the 2007 ACM symposium on Applied computing
Extracting Web Data Using Instance-Based Learning

World Wide Web
Extraction of flat and nested data records from web pages

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
OPA browser: a web browser for cellular phone users

Proceedings of the 20th annual ACM symposium on User interface software and technology
Information overload in non-visual web transaction: context analysis spells relief

Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility
Geo-tagging for imprecise regions of different sizes

Proceedings of the 4th ACM workshop on Geographical information retrieval
A methodical approach to extracting interesting objects from dynamic web pages

International Journal of Web and Grid Services
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

The Journal of Machine Learning Research
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Automated Semantic Analysis of Schematic Data

World Wide Web
Closing the loop in webpage understanding

Proceedings of the 17th ACM conference on Information and knowledge management
Categorisation of web documents using extraction ontologies

International Journal of Metadata, Semantics and Ontologies
Spatial Relation Based Object Extraction from the World Wide Web

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Information Extraction

Foundations and Trends in Databases
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Learning field compatibilities to extract database records from unstructured text

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Enhanced Gestalt Theory Guided Web Page Segmentation for Mobile Browsing

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Reusing ontologies and language components for ontology generation

Data & Knowledge Engineering
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Tag tree template for Web information and schema extraction

Expert Systems with Applications: An International Journal
An agent-based system framework for multi-slot web information extraction

CAR'10 Proceedings of the 2nd international Asia conference on Informatics in control, automation and robotics - Volume 3
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Accessibility summarization & simplification in a template-based web transcoder

Journal of Web Engineering
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Block-based language modeling approach towards web search

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Ontology-driven automatic entity disambiguation in unstructured text

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
The HiLeX system for semantic information extraction

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
A Web-based resource model for scholarship 2.0: object reuse & exchange

Concurrency and Computation: Practice & Experience
Annotation and Auto-Scrolling for Web Page Overview in Mobile Web Browsing

International Journal of Handheld Computing Research
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
Locating Discharge Medications in Natural Language Summaries

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted).