Mining data records in Web pages

Authors:
Bing Liu;Robert Grossman;Yanhong Zhai
Affiliations:
University of Illinois at Chicago, Chicago, IL;University of Illinois at Chicago, Chicago, IL;University of Illinois at Chicago, Chicago, IL
Venue:
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2003

Citing 9
Cited 120

Algorithms for string searching

ACM SIGIR Forum
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web

Liveclassifier: creating hierarchical text classifiers through web corpora

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Tree-Structured Template Generation for Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
PEWeb: Product Extraction from the Web Based on Entropy Estimation

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Detecting and Partitioning Data Objects in Complex Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Extracting relational data from HTML repositories

ACM SIGKDD Explorations Newsletter
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Opinion observer: analyzing and comparing opinions on the Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Object-level ranking: bringing order to Web objects

WWW '05 Proceedings of the 14th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

IEEE Transactions on Knowledge and Data Engineering
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Automatically Mining Result Records from Search Engine Response Pages

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Learning Object Models from Semistructured Web Documents

IEEE Transactions on Knowledge and Data Engineering
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

ACM Transactions on Internet Technology (TOIT)
Web page title extraction and its application

Information Processing and Management: an International Journal
Homepage live: automatic block tracing for web personalization

Proceedings of the 16th international conference on World Wide Web
Web object retrieval

Proceedings of the 16th international conference on World Wide Web
Adaptive record extraction from web pages

Proceedings of the 16th international conference on World Wide Web
MySearchView: a customized metasearch engine generator

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Extracting Web Data Using Instance-Based Learning

World Wide Web
Extraction of flat and nested data records from web pages

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Corroborate and learn facts from the web

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Entity ranking in Wikipedia

Proceedings of the 2008 ACM symposium on Applied computing
Towards a global schema for web entities

Proceedings of the 17th international conference on World Wide Web
Pictor: an interactive system for importing data from a website

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering
Recognition of Data Records in Semi-structured Web-Pages Using Ontology and Χ2 Statistical Distribution

ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
Bootstrapping Information Extraction from Semi-structured Web Pages

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Experiences in crawling deep web in the context of local search

Proceedings of the 2nd international workshop on Geographic information retrieval
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Uncertainty Issues and Algorithms in Automating Process Connecting Web and User

Uncertainty Reasoning for the Semantic Web I
Spatial Relation Based Object Extraction from the World Wide Web

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Information extraction from syllabi for academic e-Advising

Expert Systems with Applications: An International Journal
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Business Specific Online Information Extraction from German Websites

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Juicer: Scalable Extraction for Thread Meta-information of Web Forum

PAISI '09 Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Mining employment market via text block detection and adaptive cross-domain information extraction

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Information Extraction from Web Pages

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Entropy-Based Visual Tree Evaluation on Block Extraction

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
RENS --- Enabling a Robot to Identify a Person

ICIRA '09 Proceedings of the 2nd International Conference on Intelligent Robotics and Applications
Web data extracion using visual features

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

BIS'07 Proceedings of the 10th international conference on Business information systems
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Automatic extraction of clickable structured web contents for name entity queries

Proceedings of the 19th international conference on World wide web
WMS-extracting multiple sections data records from search engine results pages

Proceedings of the 2010 ACM Symposium on Applied Computing
Automatic data record detection in web pages

KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Mining subtrees with frequent occurrence of similar subtrees

DS'07 Proceedings of the 10th international conference on Discovery science
Using clustering for web information extraction

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Blog post and comment extraction using information quantity of web format

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
An effective method supporting data extraction and schema recognition on deep web

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Web data extraction system based on label library

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Master defect record retrieval using network-based feature association

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A unified approach for extracting multiple news attributes from news pages

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Automatically extracting web data records

AMT'10 Proceedings of the 6th international conference on Active media technology
Normalizing web product attributes and discovering domain ontology with minimal effort

Proceedings of the fourth ACM international conference on Web search and data mining
A novel method for bilingual web page acquisition from search engine web records

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Growing parallel paths for entity-page discovery

Proceedings of the 20th international conference companion on World wide web
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter
A data mining method for accurate employment search on the web

COMATIA'10 Proceedings of the 2010 international conference on Communication and management in technological innovation and academic globalization
Federated Search

Foundations and Trends in Information Retrieval
A framework for automatic annotation of web pages using the Google rich snippets vocabulary

Proceedings of the 2011 ACM Symposium on Applied Computing
An approach to assess the quality of web pages in the deep web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
From one tree to a forest: a unified solution for structured web data extraction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Information extraction from semi-structured resources: a two-phase finite state transducers approach

CIAA'11 Proceedings of the 16th international conference on Implementation and application of automata
Towards a spatial instance learning method for deep web pages

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Concluding pattern of web page based on string pattern matching

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
SILA: a spatial instance learning approach for deep webpages

Proceedings of the 20th ACM international conference on Information and knowledge management
A simhash-based scheme for locating product information from the web

Proceedings of the Second Symposium on Information and Communication Technology
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Semantically integrating portlets in portals through annotation

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Extracting and summarizing hot item features across different auction web sites

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Image description mining and hierarchical clustering on data records using HR-Tree

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
RecipeCrawler: collecting recipe data from WWW incrementally

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
CCWrapper: adaptive predefined schema guided web extraction

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Automatic wrapper generation for metasearch using ordered tree structured patterns

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
Structure detection system from web documents through backpropagation network learning

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Automatic information extraction from the web: case study with recipes

Proceedings of the 50th Annual Southeast Regional Conference
Learn-as-you-go: new ways of cloud-based micro-learning for the mobile web

ICWL'11 Proceedings of the 10th international conference on Advances in Web-Based Learning
Data extraction for search engine using safe matching

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems
An automatic web-oriented multimedia extraction and multiresolution visualization scheme

ACA'12 Proceedings of the 11th international conference on Applications of Electrical and Computer Engineering
Automated internal web page clustering for improved data extraction

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Automatically extracting user reviews from forum sites

Computers & Mathematics with Applications
A system for extracting top-K lists from the web

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Extracting data records from web using suffix tree

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Multiple sections extraction using visual cue

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Proceedings of the sixth ACM international conference on Web search and data mining
Exploring structure and content on the web: extraction and integration of the semi-structured web

Proceedings of the sixth ACM international conference on Web search and data mining
Fast algorithms for finding a minimum repetition representation of strings and trees

Discrete Applied Mathematics
Cluster-based page segmentation-a fast and precise method for web page pre-processing

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
SearchResultFinder: federated search made easy

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records in order to extract information from them to provide value-added services. Existing automatic techniques are not satisfactory because of their poor accuracies. In this paper, we propose a more effective technique to perform the task. The technique is based on two observations about data records on the Web and a string matching algorithm. The proposed technique is able to mine both contiguous and non-contiguous data records. Our experimental results show that the proposed technique outperforms existing techniques substantially.