Sifting micro-blogging stream for events of user interest

  • Authors:
  • Maxim Grinev;Maria Grineva;Alexander Boldakov;Leonid Novak;Andrey Syssoev;Dmitry Lizorkin

  • Affiliations:
  • Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.

  • Venue:
  • Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Micro-blogging is a new form of social communication that encourages users to share information about anything they are seeing or doing, the motivation facilitated by the ability to post brief text messages through a variety of devices. Twitter, the most popular micro-blogging tool, is exhibiting rapid growth [3]: up to 11% of online Americans are using Twitter by December 2008, compared to 6% in May 2008. Due to its nature, micro-blogosphere has unique features: (i) It is a source of extremely up-to-date information about what is happening in the world; (ii) It captures the wisdom of millions of people and covers a broad range of domains. These features make micro-blogosphere more than a popular medium of social communication: we believe that it has additionally become a valuable source of extremely up-to-date news on virtually any subject of user interest. Making use of micro-blogosphere in this new role we meet the following challenges: (A) Since any given subject is generally mentioned in the micro-blogging stream on the continuous basis, a method is needed for locating periods of news on this subject. (B) Additionally, even for such periods, stream filtering is required for removing noise and for extracting messages that best describe the news. To address these challenges we make and exploit the following observations: (A) For an arbitrary subject, events that catch user interest gain distinguishably more attention than the average mentioning of the subject resulting in message activity bursts for it. (B) Most of the messages in an activity burst describe common event in close variations - either rephrased or "retweeted" between the users. We demonstrate TweetSieve - a system that allows obtaining news on any given subject by sifting the Twitter stream. Our work is related to frequecy-based analysis applied to blogs [1], but higher latency and lower coverage in blogs makes the analysis less effective than in case of micro-blogs. In TweetSieve demo, the user is able to express the subject of her interest by an arbitrary search string. The system shows the period of events occuring for the subject and outputs tweets that best describe each of the events. Figure 1 shows a screenshot of the system for "Semantic search" as a sample subject. The underlying process consists of two steps: Identifying activity bursts. Counting the messages matching the search string in the stream over time, the frequency curve is constructed. Activity bursts in the curve are identified by taking the periods of frequency exceeding the standard deviation from the average. Selecting messages that best describe news events. For the set of all messages matching the search string in an activity burst, we apply the message-granular variation of our keyphrase extraction algorithm [2] that is specifically suited to efficiently filtering noisy data. The algorithm clusters messages with respect to their similarity to each other and chooses central messages from the most dense clusters. As the similarity measure we use Jaccard coefficient for the "bag of words" representation of messages. The demonstration illustrates the potential of our approach in bringing news acquisition to a new level of promptness and coverage range.