Normalized Cuts and Image Segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools
ACM SIGMOD Record
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Testbed for information extraction from deep web
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Structured databases on the web: observations and implications
ACM SIGMOD Record
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions
Proceedings of the 14th ACM international conference on Information and knowledge management
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining templates from search result records of search engines
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
UQBE: uncertain query by example for web service mashup
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
NET – a system for extracting web data from flat and nested data records
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Information extraction for search engines using fast heuristic techniques
Data & Knowledge Engineering
Automatic extraction of clickable structured web contents for name entity queries
Proceedings of the 19th international conference on World wide web
Entity relation discovery from web tables and links
Proceedings of the 19th international conference on World wide web
Automatic extraction of web data records containing user-generated content
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
A novel method for bilingual web page acquisition from search engine web records
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Lightweight collaboration management
Proceedings of the 3rd and 4th International Workshop on Web APIs and Services Mashups
Semi-supervised truth discovery
Proceedings of the 20th international conference on World wide web
FACTO: a fact lookup engine based on web tables
Proceedings of the 20th international conference on World wide web
Unexpected results in automatic list extraction on the web
ACM SIGKDD Explorations Newsletter
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
A local information passing clustering algorithm for tagging systems
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Web information extraction using markov logic networks
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting general lists from web documents: a hybrid approach
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Concluding pattern of web page based on string pattern matching
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Semi-supervised multi-task learning of structured prediction models for web information extraction
Proceedings of the 20th ACM international conference on Information and knowledge management
Towards a unified solution: data record region detection and segmentation
Proceedings of the 20th ACM international conference on Information and knowledge management
Exploiting attribute redundancy for web entity data extraction
ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
Extracting data records from query result pages based on visual features
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Data extraction for search engine using safe matching
AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
APPECT: an approximate backbone-based clustering algorithm for tags
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Building enriched web page representations using link paths
Proceedings of the 23rd ACM conference on Hypertext and social media
A system for extracting top-K lists from the web
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting data records from web using suffix tree
Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Clustering visually similar web page elements for structured web data extraction
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Web-based closed-domain data extraction on online advertisements
Information Systems
Multiple sections extraction using visual cue
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Towards web-scale structured web data extraction
Proceedings of the sixth ACM international conference on Web search and data mining
Visually extracting data records from the deep web
Proceedings of the 22nd international conference on World Wide Web companion
Structured positional entity language model for enterprise entity retrieval
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
The parallel path framework for entity discovery on the web
ACM Transactions on the Web (TWEB)
Linkage of compound objects for supporting maintenance of large-scale web sites
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Scalable and noise tolerant web knowledge extraction for search task simplification
Decision Support Systems
Hi-index | 0.00 |
Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the first step of this object extraction process, identifies a set of Web page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suffice for simple search, but they often fail to handle more complicated or noisy Web page structures due to a key limitation -- their greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the DOM tree of the Web document. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals) to estimate how likely these two tag paths represent the same list of objects. The paper introduces a similarity measure that captures how closely the visual signals appear and interleave. Clustering of tag paths is then performed based on this similarity measure, and sets of tag paths that form the structure of data records are extracted. Experiments show that this method achieves higher accuracy than previous methods.