Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
A brief survey of web data extraction tools
ACM SIGMOD Record
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Tree-Structured Template Generation for Web Pages
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web wrapper induction: a brief survey
AI Communications
Joint optimization of wrapper generation and template detection
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Pictor: an interactive system for importing data from a website
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of web data records containing user-generated content
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Unsupervised user-generated content extraction by dependency relationships
WISE'11 Proceedings of the 12th international conference on Web information system engineering
Towards a unified solution: data record region detection and segmentation
Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from web using suffix tree
Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Turn the page: automated traversal of paginated websites
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Towards web-scale structured web data extraction
Proceedings of the sixth ACM international conference on Web search and data mining
SearchResultFinder: federated search made easy
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
Accelerating Structured Web Crawling without Losing Data
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Hi-index | 0.00 |
Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website. Given a host webpage and related information needs, how to identify relevant records as well as their internal semantic structures is critical to many online information systems. Wrapper induction is one of the most effective methods for such tasks. However, most traditional wrapper techniques have issues dealing with web records since they are designed to extract information from a page, not a record. We propose a record-level wrapper system. In our system, we use a novel ``broom'' structure to represent both records and generated wrappers. With such representation, our system is able to effectively extract records and identify their internal semantics at the same time. We test our system on 16 real-life websites from four different domains. Experimental results demonstrate 99\% extraction accuracy in terms of F1-Value.