Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Indexing shared content in information retrieval systems
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A practical minimal perfect hashing method
WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
Application of Information Retrieval Techniques for Source Code Authorship Attribution
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
A Wikipedia-based multilingual retrieval model
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Self-taught hashing for fast similarity search
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Plagiarism detection across distant language pairs
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Error-correcting output hashing in fast similarity search
ICIMCS '10 Proceedings of the Second International Conference on Internet Multimedia Computing and Service
Filtering artificial texts with statistical machine learning techniques
Language Resources and Evaluation
Cross-language plagiarism detection
Language Resources and Evaluation
Information retrieval techniques for corpus filtering applied to external plagiarism detection
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Probabilistic near-duplicate detection using simhash
Proceedings of the 20th ACM international conference on Information and knowledge management
Word length n-grams for text re-use detection
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Semi-supervised spectral hashing for fast similarity search
Neurocomputing
Plagiarism Detection for Indonesian Texts
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Hi-index | 0.00 |
For the identification of plagiarized passages in large document collections we present retrieval strategies which rely on stochastic sampling and chunk indexes. Using the entire Wikipedia corpus we compile n-gram indexes and compare them to a new kind of fingerprint index in a plagiarism analysis use case. Our index provides an analysis speed-up by factor 1.5 and is an order of magnitude smaller, while being equivalent in terms of precision and recall.