Machine Learning
Machine Learning
Combining Pattern Classifiers: Methods and Algorithms
Combining Pattern Classifiers: Methods and Algorithms
Scaling to very very large corpora for natural language disambiguation
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Pattern Recognition and Machine Learning (Information Science and Statistics)
Pattern Recognition and Machine Learning (Information Science and Statistics)
Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search
ACM Transactions on Information Systems (TOIS)
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Opinion Mining and Sentiment Analysis
Foundations and Trends in Information Retrieval
The Unreasonable Effectiveness of Data
IEEE Intelligent Systems
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
MAD skills: new analysis practices for big data
Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
Proceedings of the VLDB Endowment
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
Distributed training strategies for the structured perceptron
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
An architecture for parallel topic models
Proceedings of the VLDB Endowment
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Full-text indexing for optimizing selection operations in large-scale data analytics
Proceedings of the second international workshop on MapReduce and its applications
Distributed cube materialization on holistic measures
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
SystemML: Declarative machine learning on MapReduce
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Learning to Rank for Information Retrieval and Natural Language Processing
Learning to Rank for Information Retrieval and Natural Language Processing
High-precision phrase-based document classification on a modern scale
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting adversarial advertisements in the wild
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Programming Pig
A first account on stigmergic information systems and their impact on platform development
Proceedings of the WICSA/ECSA 2012 Companion Volume
The unified logging infrastructure for data analytics at Twitter
Proceedings of the VLDB Endowment
Designing good algorithms for MapReduce and beyond
Proceedings of the Third ACM Symposium on Cloud Computing
Paragon: QoS-aware scheduling for heterogeneous datacenters
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
The big data ecosystem at LinkedIn
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience
ACM SIGKDD Explorations Newsletter
WTF: the who to follow service at Twitter
Proceedings of the 22nd international conference on World Wide Web
Predicting user activity level in social networks
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
SMINER - a platform for data mining based on service-oriented architecture
International Journal of Business Intelligence and Data Mining
Optimization strategies for A/B testing on HADOOP
Proceedings of the VLDB Endowment
QoS-Aware scheduling in heterogeneous datacenters with paragon
ACM Transactions on Computer Systems (TOCS)
Feature engineering for semantic place prediction
Pervasive and Mobile Computing
Hi-index | 0.00 |
The success of data-driven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. We begin with an overview of this platform, which handles "traditional" data warehousing and business intelligence tasks for the organization. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In particular, we have identified stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-defined functions and the materialized output of other scripts.