Semi-automatic information extraction from discussion boards with applications for anti-spam technology

Authors:
Saeed Sarencheh;Vidyasagar Potdar;Elham Afsari Yeganeh;Nazanin Firoozeh
Affiliations:
Institute for Advanced Studies in Basic Sciences, IASBS, Zanjan, Iran;Anti-Spam Research Lab (ASRL) Digital Ecosystems and Business Intelligence Institute, Curtin University, Perth, Australia;Institute for Advanced Studies in Basic Sciences, IASBS, Zanjan, Iran;Institute for Advanced Studies in Basic Sciences, IASBS, Zanjan, Iran
Venue:
ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part II
Year:
2010

Citing 13
Cited 2

Grammars have exceptions

Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
Wrapper induction for information extraction

Wrapper induction for information extraction
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Evaluation of spam detection and prevention frameworks for email and image spam: a state of art

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Template-independent wrapper for web forums

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Interactive information extraction with constrained conditional random fields

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Intelligence in wikipedia

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Web Spambot Detection Based on Web Navigation Behaviour

AINA '10 Proceedings of the 2010 24th IEEE International Conference on Advanced Information Networking and Applications
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

Spam 2.0: the problem ahead

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part II
Designing an ontology based domain specific web search engine for commonly used products using RDF

Proceedings of the CUBE International Information Technology Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Forums (or discussion boards) represent a huge information collection structured under different boards, threads and posts. The actual information entity of a forum is a post, which has the information about authors, date and time of post, actual content etc. This information is significant for a number of applications like gathering market intelligence, analyzing customer perceptions etc. However automatically extracting this information from a forum is an extremely challenging task. There are several customized parsers designed for extracting information from a particular forum platform with a specific template (e.g. SMF or phpBB), however the problem with this approach is that these parsers are dependent upon the forum platform and the template used, which makes it unrealistic to use in practical situations. Hence, in this paper we propose a semi-automatic rule based solution for extracting forum post information and inserting the extracted information to a database for the purpose of analysis. The key challenge with this solution is identifying extraction rules, which are normally forum platform and forum template specific. As a result we analyzed 72 forums to derive these rules and test the performance of the algorithm. The results indicate that we were able to extract all the required information from SMF and phpBB forum platforms, which represent the majority of forums on the web.