Similarity measures for tracking information flow

Authors:
Donald Metzler;Yaniv Bernstein;W. Bruce Croft;Alistair Moffat;Justin Zobel
Affiliations:
University of Massachusetts, Amherst, MA;RMIT University, Melbourne, Australia;University of Massachusetts, Amherst, MA;University of Melbourne, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
Proceedings of the 14th ACM international conference on Information and knowledge management
Year:
2005

Citing 9
Cited 44

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Retrieval and novelty detection at the sentence level

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
The recap system for identifying information flow

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Next steps in near-duplicate detection for eRulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
An approach to evaluate policy similarity

Proceedings of the 12th ACM symposium on Access control models and technologies
A comparison of sentence retrieval techniques

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Novelty detection for cross-lingual news stories with visual duplicates and speech transcripts

Proceedings of the 15th international conference on Multimedia
Overview and semantic issues of text mining

ACM SIGMOD Record
Measuring novelty and redundancy with multiple modalities in cross-lingual broadcast news

Computer Vision and Image Understanding
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Identifying Quotations in Reference Works and Primary Materials

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
The Evaluation of Sentence Similarity Measures

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Utilizing Semantic, Syntactic, and Question Category Information for Automated Digital Reference Services

ICADL 08 Proceedings of the 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Addressing the Variability of Natural Language Expression in Sentence Similarity with Semantic Structure of the Sentences

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Syntactic Query Models for Restatement Retrieval

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Automatically selecting answer templates to respond to customer emails

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Organization and Tagging of Blog and News Entries Based on Content Reuse

Journal of Signal Processing Systems
Similarity measures for short segments of text

ECIR'07 Proceedings of the 29th European conference on IR research
Semantic similarity measures for Malay sentences

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Web news summarization via soft clustering algorithm

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Estimation of statistical translation models based on mutual information for ad hoc information retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating text reuse discovery on the web

Proceedings of the third symposium on Information interaction in context
An improved web information summarization based on SSSC

CAR'10 Proceedings of the 2nd international Asia conference on Informatics in control, automation and robotics - Volume 3
Tracking information flow between primary and secondary news sources

WSA '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media
German encyclopedia alignment based on information retrieval techniques

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Automatic detection of local reuse

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Linking online news and social media

Proceedings of the fourth ACM international conference on Web search and data mining
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents

Web Intelligence and Agent Systems
An effective approach for searching closest sentence translations from the web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Plagiarism detection based on structural information

Proceedings of the 20th ACM international conference on Information and knowledge management
The case of the duplicate documents measurement, search, and science

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Noise robust detection of the emergence and spread of topics on the web

Proceedings of the 2nd Temporal Web Analytics Workshop
Word length n-grams for text re-use detection

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Recognising sentence similarity using similitude and dissimilarity features

International Journal of Advanced Intelligence Paradigms
Language intent models for inferring user browsing behavior

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Learning hash codes for efficient content reuse detection

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Measuring semantic relatedness using multilingual representations

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
MIRACLE experiments in QA@CLEF 2006 in Spanish: main task, real-time QA and exploratory QA using wikipedia (WiQA)

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Position-Aligned translation model for citation recommendation

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Folktale classification using learning to rank

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Unsupervised latent concept modeling to identify query facets

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Enhancing sentence-level clustering with ranking-based clustering framework for theme-based summarization

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- are useful for applications such as information flow analysis and question-answering tasks. In this paper, we explore mechanisms for measuring such intermediate kinds of similarity, focusing on the task of identifying where a particular piece of information originated. We consider both sentence-to-sentence and document-to-document comparison, and have incorporated these algorithms into RECAP, a prototype information flow analysis tool. Our experimental results with RECAP indicate that new mechanisms such as those we propose are likely to be more appropriate than existing methods for identifying the intermediate forms of similarity.