Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Building intelligent web applications using lightweight wrappers
Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
Automatic information extraction from semi-structured Web pages by pattern discovery
Decision Support Systems - Web retrieval and mining
Wrapper induction for information extraction
Wrapper induction for information extraction
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Evaluation of spam detection and prevention frameworks for email and image spam: a state of art
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Template-independent wrapper for web forums
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Interactive information extraction with constrained conditional random fields
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Web Spambot Detection Based on Web Navigation Behaviour
AINA '10 Proceedings of the 2010 24th IEEE International Conference on Advanced Information Networking and Applications
Semistructured data: the TSIMMIS experience
ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part II
Designing an ontology based domain specific web search engine for commonly used products using RDF
Proceedings of the CUBE International Information Technology Conference
Hi-index | 0.00 |
Forums (or discussion boards) represent a huge information collection structured under different boards, threads and posts. The actual information entity of a forum is a post, which has the information about authors, date and time of post, actual content etc. This information is significant for a number of applications like gathering market intelligence, analyzing customer perceptions etc. However automatically extracting this information from a forum is an extremely challenging task. There are several customized parsers designed for extracting information from a particular forum platform with a specific template (e.g. SMF or phpBB), however the problem with this approach is that these parsers are dependent upon the forum platform and the template used, which makes it unrealistic to use in practical situations. Hence, in this paper we propose a semi-automatic rule based solution for extracting forum post information and inserting the extracted information to a database for the purpose of analysis. The key challenge with this solution is identifying extraction rules, which are normally forum platform and forum template specific. As a result we analyzed 72 forums to derive these rules and test the performance of the algorithm. The results indicate that we were able to extract all the required information from SMF and phpBB forum platforms, which represent the majority of forums on the web.