Artificial intelligence (2nd ed.): structures and strategies for complex problem-solving
Artificial intelligence (2nd ed.): structures and strategies for complex problem-solving
PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
Wrapper generation for semi-structured Internet sources
ACM SIGMOD Record
ACM SIGMOD Record
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Ontology-based extraction and structuring of information from data-rich unstructured documents
Proceedings of the seventh international conference on Information and knowledge management
Adding Structure to Unstructured Data
ICDT '97 Proceedings of the 6th International Conference on Database Theory
Semi-Automatic Wrapper Generation for Internet Information Sources
COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
A Conceptual-Modeling Approach to Extracting Data from the Web
ER '98 Proceedings of the 17th International Conference on Conceptual Modeling
Learning to extract hierarchical information from semi-structured documents
Proceedings of the ninth international conference on Information and knowledge management
Function-based object model towards website adaptation
Proceedings of the 10th international conference on World Wide Web
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A brief survey of web data extraction tools
ACM SIGMOD Record
DEByE - Date extraction by example
Data & Knowledge Engineering
Automatic information extraction from semi-structured Web pages by pattern discovery
Decision Support Systems - Web retrieval and mining
Extracting Information from Semi-structured Web Documents
OOIS '02 Proceedings of the Workshops on Advances in Object-Oriented Information Systems
Toward Learning Based Web Query Processing
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Extracting Information from Semistructured Data
WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Recognition of Common Areas in a Web Page Using a Visualization Approach
AIMSA '02 Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications
Applying Pattern Mining to Web Information Extraction
PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Recognizing Ontology-Applicable Multiple-Record Web Documents
ER '01 Proceedings of the 20th International Conference on Conceptual Modeling: Conceptual Modeling
Visual Based Content Understanding towards Web Adaptation
AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Automatic Wrapper Generation for Multilingual Web Resources
DS '02 Proceedings of the 5th International Conference on Discovery Science
SCOOP: A Record Extractor without Knowledge on Input
DS '01 Proceedings of the 4th International Conference on Discovery Science
Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts
DS '01 Proceedings of the 4th International Conference on Discovery Science
Information Extraction - Tree Alignment Approach to Pattern Discovery in Web Documents
DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Improving pseudo-relevance feedback in web information retrieval using web page segmentation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Document classification via structure synopses
ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
A Fully Automated Object Extraction System for the World Wide Web
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Hearsay: enabling audio browsing on hypertext content
Proceedings of the 13th international conference on World Wide Web
Learning effective ranking functions for newsgroup search
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Toward semantic understanding: an approach based on information extraction ontologies
ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Tree-Structured Template Generation for Web Pages
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Structured databases on the web: observations and implications
ACM SIGMOD Record
Mining Web Pages for Data Records
IEEE Intelligent Systems
Editorial: special issue on web content mining
ACM SIGKDD Explorations Newsletter
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
Bootstrapping Semantic Annotation for Content-Rich HTML Documents
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
A study on combination of block importance and relevance to estimate page relevance
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automating the extraction of data from HTML tables with unknown structure
Data & Knowledge Engineering - Special issue: ER 2002
IEEE Transactions on Knowledge and Data Engineering
Adaptive web information extraction
Communications of the ACM - Two decades of the language-action perspective
A web browsing system based on adaptive presentation of web contents for cellular phones
W4A '06 Proceedings of the 2006 international cross-disciplinary workshop on Web accessibility (W4A): Building the mobile web: rediscovering accessibility?
L-tree match: a new data extraction model and algorithm for huge text stream with noises
Journal of Computer Science and Technology
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of dynamic record sections from search engine result pages
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
Exploiting types for improved schema mapping
Proceedings of the 2007 ACM symposium on Applied computing
Extracting Web Data Using Instance-Based Learning
World Wide Web
Extraction of flat and nested data records from web pages
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Mining templates from search result records of search engines
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
OPA browser: a web browser for cellular phone users
Proceedings of the 20th annual ACM symposium on User interface software and technology
Information overload in non-visual web transaction: context analysis spells relief
Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility
Geo-tagging for imprecise regions of different sizes
Proceedings of the 4th ACM workshop on Geographical information retrieval
A methodical approach to extracting interesting objects from dynamic web pages
International Journal of Web and Grid Services
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
The Journal of Machine Learning Research
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
Applied Artificial Intelligence
Automated Semantic Analysis of Schematic Data
World Wide Web
Closing the loop in webpage understanding
Proceedings of the 17th ACM conference on Information and knowledge management
Categorisation of web documents using extraction ontologies
International Journal of Metadata, Semantics and Ontologies
Spatial Relation Based Object Extraction from the World Wide Web
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Foundations and Trends in Databases
Automatic wrapper generation using tree matching and partial tree alignment
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Learning field compatibilities to extract database records from unstructured text
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Enhanced Gestalt Theory Guided Web Page Segmentation for Mobile Browsing
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Reusing ontologies and language components for ontology generation
Data & Knowledge Engineering
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Using structured tokens to identify webpages for data extraction
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Tag tree template for Web information and schema extraction
Expert Systems with Applications: An International Journal
An agent-based system framework for multi-slot web information extraction
CAR'10 Proceedings of the 2nd international Asia conference on Informatics in control, automation and robotics - Volume 3
Automatic extraction of web data records containing user-generated content
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
Accessibility summarization & simplification in a template-based web transcoder
Journal of Web Engineering
An indent shape based approach for web lists mining
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation
Proceedings of the 20th ACM international conference on Information and knowledge management
Block-based language modeling approach towards web search
APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Extracting web data using instance-based learning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Ontology-driven automatic entity disambiguation in unstructured text
ISWC'06 Proceedings of the 5th international conference on The Semantic Web
The HiLeX system for semantic information extraction
Transactions on Large-Scale Data- and Knowledge-Centered Systems V
A Web-based resource model for scholarship 2.0: object reuse & exchange
Concurrency and Computation: Practice & Experience
Annotation and Auto-Scrolling for Web Page Overview in Mobile Web Browsing
International Journal of Handheld Computing Research
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
Locating Discharge Medications in Natural Language Summaries
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
Hi-index | 0.00 |
Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted).