Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
ICML '04 Proceedings of the twenty-first international conference on Machine learning
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Adaptive information extraction
ACM Computing Surveys (CSUR)
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features
ACM Transactions on Internet Technology (TOIT)
Entity Resolution with Markov Logic
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Mining templates from search result records of search engines
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Context-aware wrapping: synchronized data extraction
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Semi-supervised learning of attribute-value pairs from product descriptions
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Extracting structured information from user queries with semi-supervised conditional random fields
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
An unsupervised approach for product record normalization across different web sites
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Simultaneous Product Attribute Name and Value Extraction from Web Pages
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Product feature categorization with multilevel latent semantic association
Proceedings of the 18th ACM conference on Information and knowledge management
OpinionIt: a text mining system for cross-lingual opinion analysis
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Normalizing web product attributes and discovering domain ontology with minimal effort
Proceedings of the fourth ACM international conference on Web search and data mining
Extracting and ranking product features in opinion documents
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Opinion word expansion and target extraction through double propagation
Computational Linguistics
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Proceedings of the 27th Annual ACM Symposium on Applied Computing
CharaParser for fine-grained semantic annotation of organism morphological descriptions
Journal of the American Society for Information Science and Technology
Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Hi-index | 0.00 |
We have developed an unsupervised framework for simultaneously extracting and normalizing attributes of products from multiple Web pages originated from different sites. Our framework is designed based on a probabilistic graphical model that can model the page-independent content information and the page-dependent layout information of the text fragments in Web pages. One characteristic of our framework is that previously unseen attributes can be discovered from the clue contained in the layout format of the text fragments. Our framework tackles both extraction and normalization tasks by jointly considering the relationship between the content and layout information. Dirichlet process prior is employed leading to another advantage that the number of discovered product attributes is unlimited. An unsupervised inference algorithm based on variational method is presented. The semantics of the normalized attributes can be visualized by examining the term weights in the model. Our framework can be applied to a wide range of Web mining applications such as product matching and retrieval. We have conducted extensive experiments from four different domains consisting of over 300 Web pages from over 150 different Web sites, demonstrating the robustness and effectiveness of our framework.