Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
Estimating the size of generalized transitive closures
VLDB '89 Proceedings of the 15th international conference on Very large data bases
Machine Learning
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
External memory algorithms
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Counting large numbers of events in small registers
Communications of the ACM
External memory algorithms and data structures: dealing with massive data
ACM Computing Surveys (CSUR)
Modern Information Retrieval
ANF: a fast and scalable tool for data mining in massive graphs
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 13th international conference on World Wide Web
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Communications of the ACM - The disappearing computer
Identifying link farm spam pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Graphs over time: densification laws, shrinking diameters and possible explanations
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Discovering large dense subgraphs in massive graphs
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Trading off space for passes in graph streaming problems
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Site level noise removal for search engines
Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
Generalizing PageRank: damping functions for link-based ranking algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Graph-based text classification: learn from your neighbors
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Linear prediction models with graph regularization for web-page categorization
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Link spam detection based on mass estimation
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Knowing a web page by the company it keeps
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A reference collection for web spam
ACM SIGIR Forum
Detecting Link Spam Using Temporal Information
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Know your neighbors: web spam detection using the web topology
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Thwarting the nigritude ultramarine: learning to identify link spam
ECML'05 Proceedings of the 16th European conference on Machine Learning
Proceedings of the 20th ACM conference on Hypertext and hypermedia
Counting ancestors to estimate authority
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Reactive information foraging for evolving goals
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Detecting product review spammers using rating behaviors
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Spammers' networks within online social networks: a case-study on Twitter
Proceedings of the 20th international conference companion on World wide web
Foundations and Trends in Information Retrieval
Portfolio: finding relevant functions and their usage
Proceedings of the 33rd International Conference on Software Engineering
Statistical feature extraction for cross-language web content quality assessment
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Combining textual content and hyperlinks in web spam detection
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Mining competitor relationships from online news: A network-based approach
Electronic Commerce Research and Applications
Efficient and effective spam filtering and re-ranking for large web datasets
Information Retrieval
Web Spam Detection by Exploring Densely Connected Subgraphs
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Scalable manipulation of archival web graphs
Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Understanding and combating link farming in the twitter social network
Proceedings of the 21st international conference on World Wide Web
Reliability prediction of webpages in the medical domain
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Efficient classifiers for multi-class classification problems
Decision Support Systems
Dynamic pagerank using evolving teleportation
WAW'12 Proceedings of the 9th international conference on Algorithms and Models for the Web Graph
Statistical cross-language Web content quality assessment
Knowledge-Based Systems
Detecting Fake Medical Web Sites Using Recursive Trust Labeling
ACM Transactions on Information Systems (TOIS)
Using site-level connections to estimate link confidence
Journal of the American Society for Information Science and Technology
Campaign extraction from social media
ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
Compact representation of Web graphs with extended functionality
Information Systems
Hi-index | 0.00 |
We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.