Exploration of document relation quality with consideration of term representation basis, term weighting and association measure

  • Authors:
  • Nichnan Kittiphattanabawon;Thanaruk Theeramunkong;Ekawit Nantajeewarawat

  • Affiliations:
  • Sirindhorn International Institute of Technology, Thammasat University, Thailand;Sirindhorn International Institute of Technology, Thammasat University, Thailand;Sirindhorn International Institute of Technology, Thammasat University, Thailand

  • Venue:
  • PAISI'10 Proceedings of the 2010 Pacific Asia conference on Intelligence and Security Informatics
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Tracking and relating news articles from several sources can play against misinformation from deceptive news stories since single source can not judge whether the information is a truth or not. Preventing misinformation in a computer system is an interesting research in intelligence and security informatics. For this task, association rule mining has been recently applied due to its performance and scalability. This paper presents an exploration on how term representation basis, term weighting and association measure affect the quality of relations discovered among news articles from several sources. Twenty four combinations initiated by two term representation bases, four term weightings, and three association measures are explored with their results compared to human judgement. A number of evaluations are conducted to compare each combination’s performance to the others’ with regard to top-k ranks. The experimental results indicate that a combination of bigram (BG), term frequency with inverse document frequency (TFIDF) and confidence (CONF), as well as a combination of BG, TFIDF and conviction (CONV), achieves the best performance to find the related documents by placing them in upper ranks with 0.41% rank-order mismatch on top-50 mined relations. However, a combination of unigram (UG), TFIDF and lift (LIFT) performs the best by locating irrelevant relations in lower ranks (top-1100) with rank-order mismatch of 9.63 %.