Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Subquadratic approximation algorithms for clustering problems in high dimensional spaces
STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Indexing and retrieval of scientific literature
Proceedings of the eighth international conference on Information and knowledge management
Clustering transactions using large items
Proceedings of the eighth international conference on Information and knowledge management
Proceedings of the 5th international conference on Intelligent user interfaces
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Min-Wise versus linear independence (extended abstract)
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Agglomerative clustering of a search engine query log
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Information retrieval on the web
ACM Computing Surveys (CSUR)
An adaptive model for optimizing performance of an incremental web crawler
Proceedings of the 10th international conference on World Wide Web
Efficient and tumble similar set retrieval
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Creating a Web community chart for navigating related communities
Proceedings of the 12th ACM conference on Hypertext and Hypermedia
Polynomial-time approximation schemes for geometric min-sum median clustering
Journal of the ACM (JACM)
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Aliasing on the world wide web: prevalence and performance implications
Proceedings of the 11th international conference on World Wide Web
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Signature extraction for overlap detection in documents
ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Topic-oriented collaborative crawling
Proceedings of the eleventh international conference on Information and knowledge management
Detecting similar documents using salient terms
Proceedings of the eleventh international conference on Information and knowledge management
Evaluating contents-link coupled web page clustering for web search results
Proceedings of the eleventh international conference on Information and knowledge management
Entropy-based link analysis for mining web informative structures
Proceedings of the eleventh international conference on Information and knowledge management
ACM Computing Surveys (CSUR)
Text Retrieval Systems for the Web
Programming and Computing Software
On computing the diameter of a point set in high dimensional Euclidean space
Theoretical Computer Science
Comparison of Overlap Detection Techniques
ICCS '02 Proceedings of the International Conference on Computational Science-Part I
Future Directions of Communities on the Web
Proceedings of the Joint JSAI 2001 Workshop on New Frontiers in Artificial Intelligence
Parallel and Distributed Document Overlap Detection on the Web
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Link Based Clustering of Web Search Results
WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Extracting Large-Scale Knowledge Bases from the Web
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Discovery of Emerging Topics between Communities on WWW
WI '01 Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development
Scalable Hierarchical Clustering Method for Sequences of Categorical Values
PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
ECDL '01 Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries
Estimating Resemblance of MIDI Documents
ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
A Derandomization Using Min-Wise Independent Permutations
RANDOM '98 Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Discovery of Web Communities Based on the Co-Occurence of References
DS '00 Proceedings of the Third International Conference on Discovery Science
Web Information Retrieval - an Algorithmic Perspective
ESA '00 Proceedings of the 8th Annual European Symposium on Algorithms
SainSE: An Intelligent Search Engine Based on WWW Structure Analysis
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On Combining Link and Contents Information for Web Page Clustering
DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
WWW '03 Proceedings of the 12th international conference on World Wide Web
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
Algorithmic aspects of information retrieval on the web
Handbook of massive data sets
Searching large text collections
Handbook of massive data sets
Clustering in massive data sets
Handbook of massive data sets
Approximation schemes for clustering problems
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
On finding common neighborhoods in massive graphs
Theoretical Computer Science
An Approximate L1-Difference Algorithm for Massive Data Streams
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Pastiche: making backup cheap and easy
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Index Compression through Document Reordering
DCC '02 Proceedings of the Data Compression Conference
TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility
ACM SIGGRAPH 2003 Papers
Searching the hypermedia web: improved topic distillation through network analytic relevance ranking
The New Review of Hypermedia and Multimedia - Hypermedia and the world wide web
Improving web search by the identification of contextual information
Intelligent exploration of the web
A derandomization using min-wise independent permutations
Journal of Discrete Algorithms
Mining Web Informative Structures and Contents Based on Entropy Analysis
IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Java and information retrieval from the Internet
PPPJ '03 Proceedings of the 2nd international conference on Principles and practice of programming in Java
Computational Linguistics - Special issue on web as corpus
Category cluster discovery from distributed WWW directories
Information Sciences—Informatics and Computer Science: An International Journal - special issue: Knowledge discovery from distributed information sources
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay
Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages
Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages
Software—Practice & Experience - Special issue: Web technologies
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Scaling IR-system evaluation using term relevance sets
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Constructing a text corpus for inexact duplicate detection
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Image similarity search with compact data structures
Proceedings of the thirteenth ACM international conference on Information and knowledge management
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
Deep Store: An Archival Storage System Architecture
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Pastiche: making backup cheap and easy
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Low distortion embeddings for edit distance
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
WWW '05 Proceedings of the 14th international conference on World Wide Web
Scaling link-based similarity search
WWW '05 Proceedings of the 14th international conference on World Wide Web
LSH forest: self-tuning indexes for similarity search
WWW '05 Proceedings of the 14th international conference on World Wide Web
Server-friendly delta compression for efficient web access
Web content caching and distribution
Identifying link farm spam pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Trend detection through temporal link analysis
Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Near-duplicate detection for eRulemaking
dg.o '05 Proceedings of the 2005 national conference on Digital government research
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching
IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web
IEEE Transactions on Knowledge and Data Engineering
Detecting malicious network traffic using inverse distributions of packet contents
Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
Finding similar files in large document repositories
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Comparison of texts streams in the presence of mild adversaries
ACSW Frontiers '05 Proceedings of the 2005 Australasian workshop on Grid computing and e-research - Volume 44
Discovering large dense subgraphs in massive graphs
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Characterizing a national community web
ACM Transactions on Internet Technology (TOIT)
RaceTrack: efficient detection of data race conditions via adaptive tracking
Proceedings of the twentieth ACM symposium on Operating systems principles
Similarity measures for tracking information flow
Proceedings of the 14th ACM international conference on Information and knowledge management
Redundant documents and search effectiveness
Proceedings of the 14th ACM international conference on Information and knowledge management
Automatic construction of multifaceted browsing interfaces
Proceedings of the 14th ACM international conference on Information and knowledge management
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
A linear time algorithm for approximate 2-means clustering
Computational Geometry: Theory and Applications
From words to corpora: recognizing translation
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents
Journal of the American Society for Information Science and Technology - Research Articles
Site level noise removal for search engines
Proceedings of the 15th international conference on World Wide Web
The web beyond popularity: a really simple system for web scale RSS
Proceedings of the 15th international conference on World Wide Web
What's really new on the web?: identifying new pages from a series of unstable web snapshots
Proceedings of the 15th international conference on World Wide Web
Computer Networks: The International Journal of Computer and Telecommunications Networking
Undue influence: eliminating the impact of link plagiarism on web search rankings
Proceedings of the 2006 ACM symposium on Applied computing
Next steps in near-duplicate detection for eRulemaking
dg.o '06 Proceedings of the 2006 international conference on Digital government research
Stable distributions, pseudorandom generators, embeddings, and data stream computation
Journal of the ACM (JACM)
Dynamic test collections: measuring search effectiveness on the live web
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Core algorithms in the CLEVER system
ACM Transactions on Internet Technology (TOIT)
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
Lazy preservation: reconstructing websites by crawling the crawlers
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
The query-vector document model
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Improving duplicate elimination in storage systems
ACM Transactions on Storage (TOS)
Ferret: a toolkit for content-based similarity search of feature-rich data
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Clustering e-commerce search engines based on their search interface pages using WISE-cluster
Data & Knowledge Engineering - Special issue: WIDM 2004
Accurate discovery of co-derivative documents via duplicate text detection
Information Systems
Efficient plagiarism detection for large code repositories
Software—Practice & Experience
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams
Proceedings of the 16th international conference on World Wide Web
Consistency-preserving caching of dynamic database content
Proceedings of the 16th international conference on World Wide Web
Efficient search in large textual collections with redundancy
Proceedings of the 16th international conference on World Wide Web
Extraction and classification of dense communities in the web
Proceedings of the 16th international conference on World Wide Web
Computational Linguistics
Improving mobile database access over wide-area networks without degrading consistency
Proceedings of the 5th international conference on Mobile systems, applications and services
Design, implementation, and evaluation of duplicate transfer detection in HTTP
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Factors affecting website reconstruction from the web infrastructure
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Deducing similarities in Java sources from bytecodes
ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Distributed text retrieval from overlapping collections
ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms
Proceedings of the 11th international conference on Artificial intelligence and law
Multiple-signal duplicate detection for search evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Principles of hash-based text retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Combinatorial algorithms for web search engines: three success stories
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Low distortion embeddings for edit distance
Journal of the ACM (JACM)
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations
Computational Linguistics
Clustering near-duplicate images in large collections
Proceedings of the international workshop on Workshop on multimedia information retrieval
Large data methods for multimedia
Proceedings of the 15th international conference on Multimedia
Foundations and Trends in Web Science
High performance index build algorithms for intranet search engines
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
A scalable pattern mining approach to web graph compression with communities
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Near-replicas of web pages detection efficient algorithm based on single MD5 fingerprint
ICAI'07 Proceedings of the 8th Conference on 8th WSEAS International Conference on Automation and Information - Volume 8
Hyperspaces for object clustering and approximate matching in peer-to-peer overlays
HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
Implementation and performance evaluation of fuzzy file block matching
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
A graph-theoretic approach to webpage segmentation
Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Web graph similarity for anomaly detection (poster)
Proceedings of the 17th international conference on World Wide Web
Sketching in adversarial environments
STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Characterizing botnets from email spam records
LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Towards breaking the quality curse.: a web-querying approach to web people search.
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Exploring traversal strategy for web forum crawling
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving web information indexing and retrieval based on center block duplication detection
International Journal of Innovative Computing and Applications
Lexicon randomization for near-duplicate detection with I-Match
The Journal of Supercomputing
Efficient semi-streaming algorithms for local triangle counting in massive graphs
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient parallel approach for identifying protein families in large-scale metagenomic data sets
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Simple Algorithms for Predicate Suggestions Using Similarity and Co-occurrence
ESWC '07 Proceedings of the 4th European conference on The Semantic Web: Research and Applications
Generating Fuzzy Equivalence Classes on RSS News Articles for Retrieving Correlated Information
ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Achieving both high precision and high recall in near-duplicate detection
Proceedings of the 17th ACM conference on Information and knowledge management
Winnowing-based text clustering
Proceedings of the 17th ACM conference on Information and knowledge management
Sindice.com: a document-oriented lookup index for open linked data
International Journal of Metadata, Semantics and Ontologies
Do not crawl in the DUST: Different URLs with similar text
ACM Transactions on the Web (TWEB)
Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Search personalization through query and page topical analysis
User Modeling and User-Adapted Interaction
A novel efficient classification algorithm for search engines
AIC'08 Proceedings of the 8th conference on Applied informatics and communications
Extraction and classification of dense implicit communities in the Web graph
ACM Transactions on the Web (TWEB)
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
IEICE - Transactions on Information and Systems
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
The design of a similarity based deduplication system
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
Botnet spam campaigns can be long lasting: evidence, implications, and analysis
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Applying syntactic similarity algorithms for enterprise information management
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Bringing your dead links back to life: a comprehensive approach and lessons learned
Proceedings of the 20th ACM conference on Hypertext and hypermedia
Web History Tools and Revisitation Support: A Survey of Existing Approaches and Directions
Foundations and Trends in Human-Computer Interaction
A method for measuring the evolution of a topic on the Web: The case of “informetrics”
Journal of the American Society for Information Science and Technology
Large linguistically-processed web corpora for multiple languages
EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Combinatorial Framework for Similarity Search
SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
Automatic retrieval of similar content using search engine query interface
Proceedings of the 18th ACM conference on Information and knowledge management
A linear time algorithm for approximate 2-means clustering
Computational Geometry: Theory and Applications
Computer Networks: The International Journal of Computer and Telecommunications Networking
Linear-time approximation schemes for clustering problems in any dimensions
Journal of the ACM (JACM)
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Granular Computing for Text Mining: New Research Challenges and Opportunities
RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Anchor text extraction for academic search
NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Electronic Notes in Theoretical Computer Science (ENTCS)
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Leveraging temporal dynamics of document content in relevance ranking
Proceedings of the third ACM international conference on Web search and data mining
On compressing the textual web
Proceedings of the third ACM international conference on Web search and data mining
A sketch-based distance oracle for web-scale graphs
Proceedings of the third ACM international conference on Web search and data mining
Teaching web information retrieval to undergraduates
Proceedings of the 41st ACM technical symposium on Computer science education
Foundations and Trends in Information Retrieval
ACM Transactions on Information Systems (TOIS)
Detecting visually similar Web pages: Application to phishing detection
ACM Transactions on Internet Technology (TOIT)
Efficient indexing of versioned document sequences
ECIR'07 Proceedings of the 29th European conference on IR research
Fast approximate duplicate detection for 2D-NMR spectra
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
An exhaustive and edge-removal algorithm to find cores in implicit communities
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
A pattern tree-based approach to learning URL normalization rules
Proceedings of the 19th international conference on World wide web
Proceedings of the 19th international conference on World wide web
Improving the efficiency of dynamic malware analysis
Proceedings of the 2010 ACM Symposium on Applied Computing
How should we solve search problems privately?
CRYPTO'07 Proceedings of the 27th annual international cryptology conference on Advances in cryptology
Organizing news archives by near-duplicate copy detection in digital libraries
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Understanding content reuse on the web: static and dynamic analyses
WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Mining Query Logs: Turning Search Usage Data into Knowledge
Foundations and Trends in Information Retrieval
Weighted shingling: an adaptation of shingling for weighted shingles
IIT'09 Proceedings of the 6th international conference on Innovations in information technology
Sampling dirty data for matching attributes
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Proceedings of the 21st ACM conference on Hypertext and hypermedia
Distributed discovery of large near-cliques
DISC'09 Proceedings of the 23rd international conference on Distributed computing
Wiki trust metrics based on phrasal analysis
WikiSym '08 Proceedings of the 4th International Symposium on Wikis
Efficient similarity estimation for systems exploiting data redundancy
INFOCOM'10 Proceedings of the 29th conference on Information communications
Caching search engine results over incremental indices
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A data-parallel toolkit for information retrieval
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient algorithms for large-scale local triangle counting
ACM Transactions on Knowledge Discovery from Data (TKDD)
A Framework for Large-Scale Detection of Web Site Defacements
ACM Transactions on Internet Technology (TOIT)
Estimating set intersection using small samples
ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
On locality-sensitive indexing in generic metric spaces
Proceedings of the Third International Conference on SImilarity Search and APplications
A lightweight privacy preserving SMS-based recommendation system for mobile users
Proceedings of the fourth ACM conference on Recommender systems
Scalable and systematic detection of buggy inconsistencies in source code
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Hierarchical service analytics for improving productivity in an enterprise service center
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
On the k-independence required by linear probing and minwise independence
ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
A hierarchical adaptive probabilistic approach for zero hour phish detection
ESORICS'10 Proceedings of the 15th European conference on Research in computer security
Automatic detection of local reuse
EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Scalable discovery of best clusters on large graphs
Proceedings of the VLDB Endowment
Graph homomorphism revisited for graph matching
Proceedings of the VLDB Endowment
Detecting duplicate web documents using clickthrough data
Proceedings of the fourth ACM international conference on Web search and data mining
Learning website hierarchies for keyword enrichment in contextual advertising
Proceedings of the fourth ACM international conference on Web search and data mining
Fixing the threshold for effective detection of near duplicate web documents in web crawling
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Exponential time improvement for min-wise based algorithms
Information and Computation
Filtering artificial texts with statistical machine learning techniques
Language Resources and Evaluation
Counting triangles and the curse of the last reducer
Proceedings of the 20th international conference on World wide web
Spam detection in online classified advertisements
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Clustering with internal connectedness
WALCOM'11 Proceedings of the 5th international conference on WALCOM: algorithms and computation
Foundations and Trends in Information Retrieval
Foundations and Trends in Information Retrieval
PRESIDIO: A Framework for Efficient Archival Data Storage
ACM Transactions on Storage (TOS)
Exploiting similarity for multi-source downloads using file handprints
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Theory and applications of b-bit minwise hashing
Communications of the ACM
Which version is this?: improving the desktop experience within a copy-aware computing ecosystem
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Omnify: investigating the visibility and effectiveness of copyright monitors
PAM'11 Proceedings of the 12th international conference on Passive and active measurement
Efficient exact edit similarity query processing with the asymmetric signature scheme
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Nova: continuous Pig/Hadoop workflows
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A unified representation of web logs for mining applications
Information Retrieval
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Duplicate page detection algorithm based on the field characteristic clustering
ICWL'10 Proceedings of the 2010 international conference on New horizons in web-based learning
Query by document via a decomposition-based two-level retrieval approach
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Ontology-driven personalized query refinement
Journal of Web Engineering
On the evolution of clusters of near-duplicate web pages
Journal of Web Engineering
Retrieving similar documents from the web
Journal of Web Engineering
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Fast locality-sensitive hashing
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient centrality monitoring for time-evolving graphs
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Towards the effective temporal association mining of spam blacklists
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Efficient duplicate detection on cloud using a new signature scheme
WAIM'11 Proceedings of the 12th international conference on Web-age information management
Web text data mining for building large scale language modelling corpus
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Partial duplicate detection for large book collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Content-driven detection of campaigns in social media
Proceedings of the 20th ACM international conference on Information and knowledge management
Probabilistic near-duplicate detection using simhash
Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora
Proceedings of the 20th ACM international conference on Information and knowledge management
Mining relational structure from millions of books: position paper
Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Detection of near-duplicate user generated contents: the SMS spam collection
Proceedings of the 3rd international workshop on Search and mining user-generated contents
An evaluation of provenance-based near-duplicates detection
International Journal of Knowledge and Web Intelligence
Blind publication: a copyright library without publication or trust
Proceedings of the 11th international conference on Security Protocols
Discovery of image versions in large collections
MMM'07 Proceedings of the 13th International conference on Multimedia Modeling - Volume Part II
Measuring redundancy level on the web
AINTEC '11 Proceedings of the 7th Asian Internet Engineering Conference
An OpenMP algorithm and implementation for clustering biological graphs
Proceedings of the first workshop on Irregular applications: architectures and algorithm
LSH-preserving functions and their applications
Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
A precise metric for measuring how much web pages change
DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
A systematic study of parameter correlations in large scale duplicate document detection
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Enhancing duplicate collection detection through replica boundary discovery
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Linear time algorithms for clustering problems in any dimensions
ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming
Optimizing K2 trees: A case for validating the maturity of network of practices
Computers & Mathematics with Applications
Compact features for detection of near-duplicates in distributed retrieval
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Classifying web data in directory structures
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Effective criteria for web page changes
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Indexing shared content in information retrieval systems
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Overcoming browser cookie churn with clustering
Proceedings of the fifth ACM international conference on Web search and data mining
IR system evaluation using nugget-based test collections
Proceedings of the fifth ACM international conference on Web search and data mining
How user behavior is related to social affinity
Proceedings of the fifth ACM international conference on Web search and data mining
Temporal shingling for version identification in web archives
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Web directory construction using lexical chains
NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Exponential time improvement for min-wise based algorithms
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
A web document classification approach based on fuzzy association concept
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Bayesian locality sensitive hashing for fast similarity search
Proceedings of the VLDB Endowment
The compass filter: search engine result personalization using web communities
ITWP'03 Proceedings of the 2003 international conference on Intelligent Techniques for Web Personalization
Factors affecting web page similarity
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
On approximation algorithms for data mining applications
Efficient Approximation and Online Algorithms
Privacy-sensitive VM retrospection
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
An effective searching method using the example-based query
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Clustering near-identical sequences for fast homology search
RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
ICE - Intelligent Clustering Engine: A clustering gadget for Google Desktop
Expert Systems with Applications: An International Journal
Fast discovery of similar sequences in large genomic collections
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
A fusion of algorithms in near duplicate document detection
PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
GPU-based minwise hashing: GPU-based minwise hashing
Proceedings of the 21st international conference companion on World Wide Web
Survey on web spam detection: principles and algorithms
ACM SIGKDD Explorations Newsletter
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors
Proceedings of the VLDB Endowment
Exploring temporal evidence in web information retrieval
FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Image and video searching on the world wide web
IM'99 Proceedings of the 1999 international conference on Challenge of Image Retrieval
CRSI: a compact randomized similarity index for set-valued features
Proceedings of the 15th International Conference on Extending Database Technology
Replica-aware caching for Web proxies
Computer Communications
Survey: Urban pervasive applications: Challenges, scenarios and case studies
Computer Science Review
Multi-resolution similarity hashing
Digital Investigation: The International Journal of Digital Forensics & Incident Response
Revisiting reverts: accurate revert detection in wikipedia
Proceedings of the 23rd ACM conference on Hypertext and social media
Making a scene: alignment of complete sets of clips based on pairwise audio match
Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Sketching in Adversarial Environments
SIAM Journal on Computing
Index maintenance for time-travel text search
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Detecting quilted web pages at scale
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Finding translations in scanned book collections
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Measuring semantic relatedness using multilingual representations
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
A model of uncertainty for near-duplicates in document reference networks
ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Constructing test collections by inferring document relevance via extracted relevant information
Proceedings of the 21st ACM international conference on Information and knowledge management
Learning to rank duplicate bug reports
Proceedings of the 21st ACM international conference on Information and knowledge management
Fast near neighbor search in high-dimensional binary data
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
PaperVis: literature review made easy
EuroVis'11 Proceedings of the 13th Eurographics / IEEE - VGTC conference on Visualization
Crawling deep web entity pages
Proceedings of the sixth ACM international conference on Web search and data mining
Robust plagiary detection using semantic compression augmented SHAPD
ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I
String similarity measures and joins with synonyms
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
International Journal of Applied Cryptography
Reducing information redundancy in search results
Proceedings of the 28th Annual ACM Symposium on Applied Computing
HmSearch: an efficient hamming distance query processing algorithm
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
An analysis of socware cascades in online social networks
Proceedings of the 22nd international conference on World Wide Web
Groundhog day: near-duplicate detection on Twitter
Proceedings of the 22nd international conference on World Wide Web
Bottom-k and priority sampling, set similarity and subset sums with minimal independence
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage
Proceedings of the 6th Balkan Conference in Informatics
Server interface descriptions for automated testing of JavaScript web applications
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Near duplicate detection in an academic digital library
Proceedings of the 2013 ACM symposium on Document engineering
A structure free self-adaptive piecewise hashing algorithm for spam filtering
Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
Spatial min-Hash for similar image search
Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
Searching similar segments over textual event sequences
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Asymmetric signature schemes for efficient exact edit similarity query processing
ACM Transactions on Database Systems (TODS)
b-bit minwise hashing in practice
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Campaign extraction from social media
ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
Dimension independent similarity computation
The Journal of Machine Learning Research
XXXtortion?: inferring registration intent in the .XXX TLD
Proceedings of the 23rd international conference on World wide web
Efficient estimation for high similarities using odd sketches
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.02 |