Statistical cross-language Web content quality assessment

Authors:
Guang-Gang Geng;Li-Ming Wang;Wei Wang;An-Lei Hu;Shuo Shen
Affiliations:
China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China;China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China;China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China;China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China;China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China
Venue:
Knowledge-Based Systems
Year:
2012

Citing 27
Cited 0

Bagging predictors

Machine Learning
Quality information and knowledge

Quality information and knowledge
Data quality assessment

Communications of the ACM - Supporting community and building social capital
DNS and BIND

DNS and BIND
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
C4.5: Programs for Machine Learning

C4.5: Programs for Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Evaluating the informative quality of documents in SGML format from judgements by means of fuzzy linguistic techniques based on computing with words

Information Processing and Management: an International Journal - Modelling vagueness and subjectivity in information access
An evaluation of statistical spam filtering techniques

ACM Transactions on Asian Language Information Processing (TALIP)
Effective Estimation of Posterior Probabilities: Explaining the Accuracy of Randomized Decision Tree Approaches

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Evaluating the information quality of Web sites: A methodology based on fuzzy computing with words: Special Topic Section on Soft Approaches to Information Retrieval and Information Access on the Web

Journal of the American Society for Information Science and Technology
Beyond PageRank: machine learning for static ranking

Proceedings of the 15th international conference on World Wide Web
An empirical study of three machine learning methods for spam filtering

Knowledge-Based Systems
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A fuzzy linguistic model to evaluate the quality of Web sites that store XML documents

International Journal of Approximate Reasoning
A comparison of machine learning techniques for phishing detection

Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Link analysis for Web spam detection

ACM Transactions on the Web (TWEB)
Using information gain to improve multi-modal information retrieval systems

Information Processing and Management: an International Journal
Mixed feature selection based on granulation and approximation

Knowledge-Based Systems
Link based small sample learning for web spam detection

Proceedings of the 18th international conference on World wide web
Nmap Network Scanning: The Official Nmap Project Guide to Network Discovery and Security Scanning

Nmap Network Scanning: The Official Nmap Project Guide to Network Discovery and Security Scanning
Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Rank for Information Retrieval

Foundations and Trends in Information Retrieval
Phishing Infrastructure Fluxes All the Way

IEEE Security and Privacy
Learning to rank with document ranks and scores

Knowledge-Based Systems
CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

ACM Transactions on Information and System Security (TISSEC)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cross-language Web content quality assessment plays an important role in many Web content processing applications. In the previous research, natural language processing, heuristic content and term frequency-inverse document frequency features based statistical systems have proven effective for Web content quality assessment. However, these are language-dependent features, which are not suitable for cross-language ranking. This paper proposes a cross-language Web content quality assessment method. First multi-modal language-independent features are extracted. The extracting features include character features, domain registration features, two-layer hyperlink analysis features and third-party Web service features. All the extracted features are then fused. Based on the fused features, feature selection is carried out to get a new eigenspace. Finally cross-language Web content quality model on the eigenspace can be learned. The experiments on ECML/PKDD 2010 Discovery Challenge cross-language datasets demonstrate that every scale feature has discriminability; different modalities of features are complementary to each other; and the feature selection is effective for statistical learning based cross-language Web content quality assessment.