Fast text searching: allowing errors
Communications of the ACM
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Algorithm 457: finding all cliques of an undirected graph
Communications of the ACM
Communications of the ACM
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Building efficient and effective metasearch engines
ACM Computing Surveys (CSUR)
A brief survey of web data extraction tools
ACM SIGMOD Record
Comparing Hierarchical Data in External Memory
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Proceedings of the 27th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Wrapper induction for information extraction
Wrapper induction for information extraction
Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine
WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic detection of fragments in dynamically generated web pages
Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Structured databases on the web: observations and implications
ACM SIGMOD Record
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions
Proceedings of the 14th ACM international conference on Information and knowledge management
Extracting web data using instance-based learning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Information extraction for search engines using fast heuristic techniques
Data & Knowledge Engineering
BIS'07 Proceedings of the 10th international conference on Business information systems
WMS-extracting multiple sections data records from search engine results pages
Proceedings of the 2010 ACM Symposium on Applied Computing
Blog post and comment extraction using information quantity of web format
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
A novel method for bilingual web page acquisition from search engine web records
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Foundations and Trends in Information Retrieval
Potential role based entity matching for dataspaces search
WISE'10 Proceedings of the 11th international conference on Web information systems engineering
An approach to assess the quality of web pages in the deep web
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
An automatic web news article contents extraction system based on RSS feeds
Journal of Web Engineering
Extracting data records from query result pages based on visual features
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Automated functional testing of online search services
Software Testing, Verification & Reliability
TEX: An efficient and effective unsupervised Web information extractor
Knowledge-Based Systems
Multiple sections extraction using visual cue
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Visually extracting data records from the deep web
Proceedings of the 22nd international conference on World Wide Web companion
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
A search engine returned result page may contain search results that are organized into multiple dynamically generated sections in response to a user query. Furthermore, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine. In this paper, we present a method to automatically generate wrappers for extracting search result records from all dynamic sections on result pages returned by search engines. This method has the following novel features: (1) it aims to explicitly identify all dynamic sections, including those that are not seen on sample result pages used to generate the wrapper, and (2) it addresses the issue of correctly differentiating sections and records. Experimental results indicate that this method is very promising. Automatic search result record extraction is critical for applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling.