Instructor: Foster Provost and Brian Dalessandro

Teaching Assistant: TBD

Lectures (two sections): Tuesdays, 6pm to 9pm or Wednesdays, 6 pm to 9 pm, room tbd

NYU Classes: TBD

Office Hours

TBD

Course Description is taken from Fall 2014. Description will be updated prior to Fall.

Businesses, governments, and individuals create massive collections of data as a by-product of their day-to-day activity. Increasingly, decision-makers and systems rely on intelligent technology to analyze data systematically in order to improve decision-making. In many cases automating analytical and decision-making processes is necessary because of the volume of data and the speed with which new data are generated.

We will examine how data analysis can be used to improve decision-making.  We will study the fundamental principles and techniques of data mining, and we will examine real-world examples and cases to place data-mining techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science.  In addition, we will work “hands-on” with the Python programming language and its associated data analysis libraries. 

After taking this course you should:

  1. Understand what a Data Scientist is. The roles of a professional Data Scientist come in many flavors. With this class you will be able to identify where you fit in the data science spectrum.
  2.  Approach applicable problems data-analytically. Think carefully & systematically about whether & how data can improve decision making across a wide set of applications.
  3.  Have had hands-on experience mining data.  Be prepared to follow up on ideas or opportunities that present themselves using common computer programming tools

 

Course Outline

The course we be organized by themes, with lectures presenting the relevant theory and working examples of each topic.

Section 1 – Understanding Data and Data Science

The initial lectures will try to define data science and its relationship to data driven decision making. We’ll cover the types of roles data scientists play in industrial contexts as well as cover the requisite skills for each role. We’ll cover at a high level the entire data mining process, and then focus initially on data management, validation and exploration. Most industrial data systems are designed for transactional efficiency and not for analysis, and we’ll discuss what this means for the data scientist.

Example section materials:

2014 Intro to DS Module 2 – What is data science?

2014 Intro to DS Module 4 – Data Mining Overview

Section 2 – Learning from Data

Most data has inherent structure, and it is the data scientist’s primary responsibility to learn that structure and use it to guide strategic decision making. In this module we’ll develop a toolbox for learning structure in both supervised and unsupervised contexts. Each tool will be taught with its appropriate theoretical formulation as well as with practical tips on implementation within Python.  Each lecture will present illustrated examples of how the given data mining tools can be used in relevant contexts.

Example section materials:

2014 Intro to DS Module 7 – Finding Structure

2014 Intro to DS Module 10 – Decision Trees

Section 3 – Practicing the “Science” of Data Science

No problem ever has a single possible solution. Every design decision (from the features, algorithms to the evaluation metrics) is a possible hypothesis with its own costs and benefits.  In this set of lectures, having learned the appropriate set of technical tools, we’ll establish a discipline for testing multiple design hypotheses and choosing the one that performs optimally while satisfying the given constraints. We’ll practice the art of translating problem statements into technical design decisions (such as understanding what evaluation metric is appropriate given the goal of the task).

 Module 4 – Special Topics in Applying Data Science

Once we’ve established protocols for using the tools of data science to solve problems optimally, we’ll cover special topics that often arise in practice. Example topics will be how many of the standard approaches to problem solving change in the presence of Big Data, or how we can go beyond just establishing correlations within data and begin to understand causal relationships between events. We’ll also spend time to give students the opportunity to practice a core skill within data science – written and oral communication.