Syntactic clustering of the Web

Authors:
Andrei Z. Broder;Steven C. Glassman;Mark S. Manasse;Geoffrey Zweig
Affiliations:
-;-;-;-
Venue:
Selected papers from the sixth international conference on World Wide Web
Year:
1997

Citing 0
Cited 366

Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Subquadratic approximation algorithms for clustering problems in high dimensional spaces

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Indexing and retrieval of scientific literature

Proceedings of the eighth international conference on Information and knowledge management
Clustering transactions using large items

Proceedings of the eighth international conference on Information and knowledge management
Enhancing information retrieval by automatic acquisition of textual relations using genetic programming

Proceedings of the 5th international conference on Intelligent user interfaces
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Min-Wise versus linear independence (extended abstract)

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Agglomerative clustering of a search engine query log

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Information retrieval on the web

ACM Computing Surveys (CSUR)
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Efficient and tumble similar set retrieval

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Creating a Web community chart for navigating related communities

Proceedings of the 12th ACM conference on Hypertext and Hypermedia
Polynomial-time approximation schemes for geometric min-sum median clustering

Journal of the ACM (JACM)
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Aliasing on the world wide web: prevalence and performance implications

Proceedings of the 11th international conference on World Wide Web
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Collection synthesis

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Signature extraction for overlap detection in documents

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Topic-oriented collaborative crawling

Proceedings of the eleventh international conference on Information and knowledge management
Detecting similar documents using salient terms

Proceedings of the eleventh international conference on Information and knowledge management
Evaluating contents-link coupled web page clustering for web search results

Proceedings of the eleventh international conference on Information and knowledge management
Entropy-based link analysis for mining web informative structures

Proceedings of the eleventh international conference on Information and knowledge management
A survey of Web metrics

ACM Computing Surveys (CSUR)
Text Retrieval Systems for the Web

Programming and Computing Software
On computing the diameter of a point set in high dimensional Euclidean space

Theoretical Computer Science
Comparison of Overlap Detection Techniques

ICCS '02 Proceedings of the International Conference on Computational Science-Part I
Future Directions of Communities on the Web

Proceedings of the Joint JSAI 2001 Workshop on New Frontiers in Artificial Intelligence
Parallel and Distributed Document Overlap Detection on the Web

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Link Based Clustering of Web Search Results

WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Extracting Large-Scale Knowledge Bases from the Web

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Discovery of Emerging Topics between Communities on WWW

WI '01 Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development
Scalable Hierarchical Clustering Method for Sequences of Categorical Values

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Using Copy-Detection and Text Comparison Algorithms for Cross-Referencing Multiple Editions of Literary Works

ECDL '01 Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries
Estimating Resemblance of MIDI Documents

ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
A Derandomization Using Min-Wise Independent Permutations

RANDOM '98 Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Discovery of Web Communities Based on the Co-Occurence of References

DS '00 Proceedings of the Third International Conference on Discovery Science
Web Information Retrieval - an Algorithmic Perspective

ESA '00 Proceedings of the 8th Annual European Symposium on Algorithms
SainSE: An Intelligent Search Engine Based on WWW Structure Analysis

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On Combining Link and Contents Information for Web Page Clustering

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Searching the workplace web

WWW '03 Proceedings of the 12th international conference on World Wide Web
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Algorithmic aspects of information retrieval on the web

Handbook of massive data sets
Searching large text collections

Handbook of massive data sets
Clustering in massive data sets

Handbook of massive data sets
Approximation schemes for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
On finding common neighborhoods in massive graphs

Theoretical Computer Science
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Pastiche: making backup cheap and easy

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility

ACM SIGGRAPH 2003 Papers
Searching the hypermedia web: improved topic distillation through network analytic relevance ranking

The New Review of Hypermedia and Multimedia - Hypermedia and the world wide web
Improving web search by the identification of contextual information

Intelligent exploration of the web
A derandomization using min-wise independent permutations

Journal of Discrete Algorithms
Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Java and information retrieval from the Internet

PPPJ '03 Proceedings of the 2nd international conference on Principles and practice of programming in Java
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Category cluster discovery from distributed WWW directories

Information Sciences—Informatics and Computer Science: An International Journal - special issue: Knowledge discovery from distributed information sources
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages

Software—Practice & Experience - Special issue: Web technologies
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces

Machine Learning
Scaling IR-system evaluation using term relevance sets

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Constructing a text corpus for inexact duplicate detection

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Image similarity search with compact data structures

Proceedings of the thirteenth ACM international conference on Information and knowledge management
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Low distortion embeddings for edit distance

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Scaling link-based similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Server-friendly delta compression for efficient web access

Web content caching and distribution
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Trend detection through temporal link analysis

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Near-duplicate detection for eRulemaking

dg.o '05 Proceedings of the 2005 national conference on Digital government research
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching

IEEE Transactions on Knowledge and Data Engineering
Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources

IEEE Transactions on Knowledge and Data Engineering
Analysis of source identified text corpora: exploring the statistics of the reused text and authorship

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web

IEEE Transactions on Knowledge and Data Engineering
Detecting malicious network traffic using inverse distributions of packet contents

Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Comparison of texts streams in the presence of mild adversaries

ACSW Frontiers '05 Proceedings of the 2005 Australasian workshop on Grid computing and e-research - Volume 44
Discovering large dense subgraphs in massive graphs

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Characterizing a national community web

ACM Transactions on Internet Technology (TOIT)
RaceTrack: efficient detection of data race conditions via adaptive tracking

Proceedings of the twentieth ACM symposium on Operating systems principles
Similarity measures for tracking information flow

Proceedings of the 14th ACM international conference on Information and knowledge management
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Automatic construction of multifaceted browsing interfaces

Proceedings of the 14th ACM international conference on Information and knowledge management
Phishing Webpage Detection

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
A linear time algorithm for approximate 2-means clustering

Computational Geometry: Theory and Applications
From words to corpora: recognizing translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
Site level noise removal for search engines

Proceedings of the 15th international conference on World Wide Web
The web beyond popularity: a really simple system for web scale RSS

Proceedings of the 15th international conference on World Wide Web
What's really new on the web?: identifying new pages from a series of unstable web snapshots

Proceedings of the 15th international conference on World Wide Web
A short walk in the Blogistan

Computer Networks: The International Journal of Computer and Telecommunications Networking
Undue influence: eliminating the impact of link plagiarism on web search rankings

Proceedings of the 2006 ACM symposium on Applied computing
Next steps in near-duplicate detection for eRulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Stable distributions, pseudorandom generators, embeddings, and data stream computation

Journal of the ACM (JACM)
Dynamic test collections: measuring search effectiveness on the live web

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Core algorithms in the CLEVER system

ACM Transactions on Internet Technology (TOIT)
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Lazy preservation: reconstructing websites by crawling the crawlers

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
The query-vector document model

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
Ferret: a toolkit for content-based similarity search of feature-rich data

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Clustering e-commerce search engines based on their search interface pages using WISE-cluster

Data & Knowledge Engineering - Special issue: WIDM 2004
Accurate discovery of co-derivative documents via duplicate text detection

Information Systems
Efficient plagiarism detection for large code repositories

Software—Practice & Experience
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams

Proceedings of the 16th international conference on World Wide Web
Consistency-preserving caching of dynamic database content

Proceedings of the 16th international conference on World Wide Web
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
Extraction and classification of dense communities in the web

Proceedings of the 16th international conference on World Wide Web
Googleology is Bad Science

Computational Linguistics
Improving mobile database access over wide-area networks without degrading consistency

Proceedings of the 5th international conference on Mobile systems, applications and services
Design, implementation, and evaluation of duplicate transfer detection in HTTP

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Factors affecting website reconstruction from the web infrastructure

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Deducing similarities in Java sources from bytecodes

ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms

Proceedings of the 11th international conference on Artificial intelligence and law
Multiple-signal duplicate detection for search evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Combinatorial algorithms for web search engines: three success stories

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Low distortion embeddings for edit distance

Journal of the ACM (JACM)
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Clustering near-duplicate images in large collections

Proceedings of the international workshop on Workshop on multimedia information retrieval
Large data methods for multimedia

Proceedings of the 15th international conference on Multimedia
A framework for web science

Foundations and Trends in Web Science
High performance index build algorithms for intranet search engines

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A scalable pattern mining approach to web graph compression with communities

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Near-replicas of web pages detection efficient algorithm based on single MD5 fingerprint

ICAI'07 Proceedings of the 8th Conference on 8th WSEAS International Conference on Automation and Information - Volume 8
Hyperspaces for object clustering and approximate matching in peer-to-peer overlays

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
Implementation and performance evaluation of fuzzy file block matching

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
A graph-theoretic approach to webpage segmentation

Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Web graph similarity for anomaly detection (poster)

Proceedings of the 17th international conference on World Wide Web
Sketching in adversarial environments

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Characterizing botnets from email spam records

LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Towards breaking the quality curse.: a web-querying approach to web people search.

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
Lexicon randomization for near-duplicate detection with I-Match

The Journal of Supercomputing
Efficient semi-streaming algorithms for local triangle counting in massive graphs

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient parallel approach for identifying protein families in large-scale metagenomic data sets

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Simple Algorithms for Predicate Suggestions Using Similarity and Co-occurrence

ESWC '07 Proceedings of the 4th European conference on The Semantic Web: Research and Applications
Generating Fuzzy Equivalence Classes on RSS News Articles for Retrieving Correlated Information

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
Winnowing-based text clustering

Proceedings of the 17th ACM conference on Information and knowledge management
Sindice.com: a document-oriented lookup index for open linked data

International Journal of Metadata, Semantics and Ontologies
Do not crawl in the DUST: Different URLs with similar text

ACM Transactions on the Web (TWEB)
Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Search personalization through query and page topical analysis

User Modeling and User-Adapted Interaction
A novel efficient classification algorithm for search engines

AIC'08 Proceedings of the 8th conference on Applied informatics and communications
Extraction and classification of dense implicit communities in the Web graph

ACM Transactions on the Web (TWEB)
Annotate once, appear anywhere: collective foraging for snippets of interest using paragraph fingerprinting

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Monotone Increasing Binary Similarity and Its Application to Automatic Document-Acquisition of a Category

IEICE - Transactions on Information and Systems
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Substring Statistics

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Social spam detection

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
Botnet spam campaigns can be long lasting: evidence, implications, and analysis

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Bringing your dead links back to life: a comprehensive approach and lessons learned

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Web History Tools and Revisitation Support: A Survey of Existing Approaches and Directions

Foundations and Trends in Human-Computer Interaction
A method for measuring the evolution of a topic on the Web: The case of “informetrics”

Journal of the American Society for Information Science and Technology
Large linguistically-processed web corpora for multiple languages

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Combinatorial Framework for Similarity Search

SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
A linear time algorithm for approximate 2-means clustering

Computational Geometry: Theory and Applications
A short walk in the Blogistan

Computer Networks: The International Journal of Computer and Telecommunications Networking
Linear-time approximation schemes for clustering problems in any dimensions

Journal of the ACM (JACM)
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Granular Computing for Text Mining: New Research Challenges and Opportunities

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Anchor text extraction for academic search

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
How Recent is a Web Document?

Electronic Notes in Theoretical Computer Science (ENTCS)
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Leveraging temporal dynamics of document content in relevance ranking

Proceedings of the third ACM international conference on Web search and data mining
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
A sketch-based distance oracle for web-scale graphs

Proceedings of the third ACM international conference on Web search and data mining
Teaching web information retrieval to undergraduates

Proceedings of the 41st ACM technical symposium on Computer science education
Web Crawling

Foundations and Trends in Information Retrieval
Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

ACM Transactions on Information Systems (TOIS)
Detecting visually similar Web pages: Application to phishing detection

ACM Transactions on Internet Technology (TOIT)
Efficient indexing of versioned document sequences

ECIR'07 Proceedings of the 29th European conference on IR research
Fast approximate duplicate detection for 2D-NMR spectra

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
An exhaustive and edge-removal algorithm to find cores in implicit communities

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Improving the efficiency of dynamic malware analysis

Proceedings of the 2010 ACM Symposium on Applied Computing
How should we solve search problems privately?

CRYPTO'07 Proceedings of the 27th annual international cryptology conference on Advances in cryptology
Organizing news archives by near-duplicate copy detection in digital libraries

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Understanding content reuse on the web: static and dynamic analyses

WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
Weighted shingling: an adaptation of shingling for weighted shingles

IIT'09 Proceedings of the 6th international conference on Innovations in information technology
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Is this a good title?

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Distributed discovery of large near-cliques

DISC'09 Proceedings of the 23rd international conference on Distributed computing
Wiki trust metrics based on phrasal analysis

WikiSym '08 Proceedings of the 4th International Symposium on Wikis
Efficient similarity estimation for systems exploiting data redundancy

INFOCOM'10 Proceedings of the 29th conference on Information communications
Caching search engine results over incremental indices

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A data-parallel toolkit for information retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient algorithms for large-scale local triangle counting

ACM Transactions on Knowledge Discovery from Data (TKDD)
A Framework for Large-Scale Detection of Web Site Defacements

ACM Transactions on Internet Technology (TOIT)
Estimating set intersection using small samples

ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
On locality-sensitive indexing in generic metric spaces

Proceedings of the Third International Conference on SImilarity Search and APplications
A lightweight privacy preserving SMS-based recommendation system for mobile users

Proceedings of the fourth ACM conference on Recommender systems
Scalable and systematic detection of buggy inconsistencies in source code

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Hierarchical service analytics for improving productivity in an enterprise service center

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
On the k-independence required by linear probing and minwise independence

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
A hierarchical adaptive probabilistic approach for zero hour phish detection

ESORICS'10 Proceedings of the 15th European conference on Research in computer security
Automatic detection of local reuse

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Scalable discovery of best clusters on large graphs

Proceedings of the VLDB Endowment
Graph homomorphism revisited for graph matching

Proceedings of the VLDB Endowment
Detecting duplicate web documents using clickthrough data

Proceedings of the fourth ACM international conference on Web search and data mining
Learning website hierarchies for keyword enrichment in contextual advertising

Proceedings of the fourth ACM international conference on Web search and data mining
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Exponential time improvement for min-wise based algorithms

Information and Computation
Filtering artificial texts with statistical machine learning techniques

Language Resources and Evaluation
Counting triangles and the curse of the last reducer

Proceedings of the 20th international conference on World wide web
Spam detection in online classified advertisements

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Clustering with internal connectedness

WALCOM'11 Proceedings of the 5th international conference on WALCOM: algorithms and computation
Adversarial Web Search

Foundations and Trends in Information Retrieval
Federated Search

Foundations and Trends in Information Retrieval
PRESIDIO: A Framework for Efficient Archival Data Storage

ACM Transactions on Storage (TOS)
Exploiting similarity for multi-source downloads using file handprints

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Theory and applications of b-bit minwise hashing

Communications of the ACM
Which version is this?: improving the desktop experience within a copy-aware computing ecosystem

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Omnify: investigating the visibility and effectiveness of copyright monitors

PAM'11 Proceedings of the 12th international conference on Passive and active measurement
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A unified representation of web logs for mining applications

Information Retrieval
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Duplicate page detection algorithm based on the field characteristic clustering

ICWL'10 Proceedings of the 2010 international conference on New horizons in web-based learning
Query by document via a decomposition-based two-level retrieval approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Ontology-driven personalized query refinement

Journal of Web Engineering
On the evolution of clusters of near-duplicate web pages

Journal of Web Engineering
Retrieving similar documents from the web

Journal of Web Engineering
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Fast locality-sensitive hashing

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient centrality monitoring for time-evolving graphs

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Towards the effective temporal association mining of spam blacklists

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Web text data mining for building large scale language modelling corpus

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Content-driven detection of campaigns in social media

Proceedings of the 20th ACM international conference on Information and knowledge management
Probabilistic near-duplicate detection using simhash

Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora

Proceedings of the 20th ACM international conference on Information and knowledge management
Mining relational structure from millions of books: position paper

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Detection of near-duplicate user generated contents: the SMS spam collection

Proceedings of the 3rd international workshop on Search and mining user-generated contents
An evaluation of provenance-based near-duplicates detection

International Journal of Knowledge and Web Intelligence
Blind publication: a copyright library without publication or trust

Proceedings of the 11th international conference on Security Protocols
Discovery of image versions in large collections

MMM'07 Proceedings of the 13th International conference on Multimedia Modeling - Volume Part II
Measuring redundancy level on the web

AINTEC '11 Proceedings of the 7th Asian Internet Engineering Conference
An OpenMP algorithm and implementation for clustering biological graphs

Proceedings of the first workshop on Irregular applications: architectures and algorithm
LSH-preserving functions and their applications

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
A precise metric for measuring how much web pages change

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
A systematic study of parameter correlations in large scale duplicate document detection

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Enhancing duplicate collection detection through replica boundary discovery

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Linear time algorithms for clustering problems in any dimensions

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming
Optimizing K2 trees: A case for validating the maturity of network of practices

Computers & Mathematics with Applications
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Classifying web data in directory structures

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Effective criteria for web page changes

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Overcoming browser cookie churn with clustering

Proceedings of the fifth ACM international conference on Web search and data mining
IR system evaluation using nugget-based test collections

Proceedings of the fifth ACM international conference on Web search and data mining
How user behavior is related to social affinity

Proceedings of the fifth ACM international conference on Web search and data mining
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Web directory construction using lexical chains

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Exponential time improvement for min-wise based algorithms

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
A web document classification approach based on fuzzy association concept

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment
The compass filter: search engine result personalization using web communities

ITWP'03 Proceedings of the 2003 international conference on Intelligent Techniques for Web Personalization
Factors affecting web page similarity

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
On approximation algorithms for data mining applications

Efficient Approximation and Online Algorithms
Privacy-sensitive VM retrospection

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
An effective searching method using the example-based query

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Clustering near-identical sequences for fast homology search

RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
ICE - Intelligent Clustering Engine: A clustering gadget for Google Desktop

Expert Systems with Applications: An International Journal
Fast discovery of similar sequences in large genomic collections

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
A fusion of algorithms in near duplicate document detection

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
GPU-based minwise hashing: GPU-based minwise hashing

Proceedings of the 21st international conference companion on World Wide Web
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Exploring temporal evidence in web information retrieval

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Image and video searching on the world wide web

IM'99 Proceedings of the 1999 international conference on Challenge of Image Retrieval
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
Replica-aware caching for Web proxies

Computer Communications
Survey: Urban pervasive applications: Challenges, scenarios and case studies

Computer Science Review
Multi-resolution similarity hashing

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Revisiting reverts: accurate revert detection in wikipedia

Proceedings of the 23rd ACM conference on Hypertext and social media
Making a scene: alignment of complete sets of clips based on pairwise audio match

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Sketching in Adversarial Environments

SIAM Journal on Computing
Index maintenance for time-travel text search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Detecting quilted web pages at scale

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Finding translations in scanned book collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Measuring semantic relatedness using multilingual representations

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
A model of uncertainty for near-duplicates in document reference networks

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Constructing test collections by inferring document relevance via extracted relevant information

Proceedings of the 21st ACM international conference on Information and knowledge management
Learning to rank duplicate bug reports

Proceedings of the 21st ACM international conference on Information and knowledge management
Fast near neighbor search in high-dimensional binary data

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
PaperVis: literature review made easy

EuroVis'11 Proceedings of the 13th Eurographics / IEEE - VGTC conference on Visualization
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining
Robust plagiary detection using semantic compression augmented SHAPD

ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I
String similarity measures and joins with synonyms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Secure computation of functionalities based on Hamming distance and its application to computing document similarity

International Journal of Applied Cryptography
Reducing information redundancy in search results

Proceedings of the 28th Annual ACM Symposium on Applied Computing
HmSearch: an efficient hamming distance query processing algorithm

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
An analysis of socware cascades in online social networks

Proceedings of the 22nd international conference on World Wide Web
Groundhog day: near-duplicate detection on Twitter

Proceedings of the 22nd international conference on World Wide Web
Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Proceedings of the forty-fifth annual ACM symposium on Theory of computing
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage

Proceedings of the 6th Balkan Conference in Informatics
Server interface descriptions for automated testing of JavaScript web applications

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Near duplicate detection in an academic digital library

Proceedings of the 2013 ACM symposium on Document engineering
A structure free self-adaptive piecewise hashing algorithm for spam filtering

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
Spatial min-Hash for similar image search

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
Searching similar segments over textual event sequences

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
b-bit minwise hashing in practice

Proceedings of the 5th Asia-Pacific Symposium on Internetware
Campaign extraction from social media

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
Dimension independent similarity computation

The Journal of Machine Learning Research
XXXtortion?: inferring registration intent in the .XXX TLD

Proceedings of the 23rd international conference on World wide web
Efficient estimation for high similarities using odd sketches

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.02

Syntactic clustering of the Web

Quantified Score

Visualization

Abstract