Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Data Mining with R: Learning with Case Studies
Data Mining with R: Learning with Case Studies
Communications of the ACM
Transforming introductory computer science projects via real-time web data
Proceedings of the 45th ACM technical symposium on Computer science education
Hi-index | 0.00 |
Over the past few years many governmental and non-profit organizations have begun releasing all sorts of datasets to the public. These datasets, commonly referred to as open data, typically cover subject areas such as transportation, weather, economics, health and the environment. This freely-available data provides an excellent opportunity to expose database students to "real-world" data. "Real-world" data sets are rarely pristine -- they are frequently poorly formatted, designed and documented. They may even contain inconsistencies. It is important for database students be exposed to these types of data sets and to learn how to work with them. The purpose of this paper is to introduce the reader to some useful NYC data sets and describe how the author utilized them in an Introduction to Database Systems course.