Identifying syntactic differences between two programs
Software—Practice & Experience
Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Clone Detection Using Abstract Syntax Trees
ICSM '98 Proceedings of the International Conference on Software Maintenance
Automatic information extraction from large websites
Journal of the ACM (JACM)
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Identifying text polarity using random walks
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Extracting web data using instance-based learning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Survey: An overview on XML similarity: Background, current trends and future directions
Computer Science Review
Hi-index | 0.00 |
The World Wide Web provides continuous sources of information with similar semantic structure like news feeds, user reviews and user comments on various topics. These sources are essential for the goal of online opinion mining. The paper proposes a computationally efficient algorithm for structured information extraction from web pages. The algorithm relies on a combination of analysis of structured data and natural language processing of text content. It maps HTML pages containing news, reviews or user comments to a custom designed RSS feed like structure. Such information usually includes the textual opinions, and factual information like publication date, product price, author name and influence. Due to the real time nature of the data sources the computational complexity of such a solution should be linear or close to linear. The computational complexity of the proposed algorithm is linear. In comparison similar previously published approaches have complexity no smaller than O(n2). Further we conduct experiments with real world data that achieves extraction accuracy of 84% to 92% which is comparable to the recent results in this field. Finally the paper discuses the results of the experiment and shares gained experience that can be useful for applying the algorithm in other domains.