SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Extracting semi-structured data through examples
Proceedings of the eighth international conference on Information and knowledge management
Conceptual-model-based data extraction from multiple-record Web pages
Data & Knowledge Engineering
Machine Learning for Information Extraction in Informal Domains
Machine Learning - Special issue on information retrieval
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Building intelligent web applications using lightweight wrappers
Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A brief survey of web data extraction tools
ACM SIGMOD Record
DEByE - Date extraction by example
Data & Knowledge Engineering
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
WebOQL: Restructuring Documents, Databases, and Webs
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
A Fully Automated Object Extraction System for the World Wide Web
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
PEWeb: Product Extraction from the Web Based on Entropy Estimation
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Semistructured data: the TSIMMIS experience
ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
Hi-index | 0.00 |
With the explosive growth of commercial websites and Internet-based services, it is crucial to have efficient search services specialized for product information. We share the observation in PEWeb [24], that products are almost always displayed in range of similar-look info pieces showing features and prices for customers to choose and so, the webpage DOM tree would have similar subtrees in the parts corresponding to the product show areas. We propose to use a special hash function, namely Simhash [18], for identifying the product regions. As a signal, subtrees (in the webpage DOM tree) with similar structures would have similar Simhash fingerprints (separated just by a few bits). To eliminate possible miscalls in the first phase using Simhash, we also combine with a decision tree approach which gives us more flexibility especially with product websites developed by Vietnamese companies which prefer certain display formats not very popular worldwide. Compared to PEWeb, our scheme can be more refined and flexible where we have more options to adjust the scheme. This improvement in preciseness is strongly supported by experimental results.