### DS-GA-1001: Introduction to Data Science

** Course credits: **3

**One**

Year of the Curriculum:Year of the Curriculum:

**Fall**

Semester:Semester:

**Instructor:**Professor Foster, Provost (

*Section .001)*

**Instructor:**Professor Perlich, Claudia (

*Section .002)*

Introduces students to the fundamental principles of data science that underlie the algorithms, processes, methods, and data-analytic thinking. Introduces students to algorithms and tools based on these principles. Introduces frameworks to support problem-focused data-analytic thinking.

**Course aims and objectives:**

After taking this class, student should:

- Approach business problems data-analytically. Think carefully and systematically about whether & how data can improve a particular application, to understand a phenomenon better and especially to make better-informed decisions and automated decisions.
- Understand fundamental principles of data science, such as using data to get information about an unknown quantity of interest, calculating and using data similarity, fitting models to data, supervised and unsupervised modeling, overfitting and its avoidance, evaluation and model analytics, visualization, predictive modeling, causal inference, the data mining process, problem decomposition, data science strategy, solution deployment, and more.
- Be able to apply the most important data science methods, using open-source tools.

**Prerequisites:
**

- Basic probability or Statistics (undergraduate level)
- Linear Algebra
- Some experience in programming: Java, C, C++, Python, Perl, or similar languages, equivalent to two introductory courses in programming, such as “Introduction to Programming” and “Data Structures and Algorithms.”

**Co-requisite:**

- “Programming for Data Science” (waived with adequate experience; as decided by MSDS program administration)

### DS-GA-1002: Statistical and Mathematical Methods

** Course credits: **3

**One**

Year of the Curriculum:Year of the Curriculum:

**Fall**

Semester:Semester:

**Instructor:**Professor Varadhan, S.R. Srinivasa

This course briefly introduces basic statistical and mathematical methods needed in the practice of data science. It covers basic methods in probability, statistics, linear algebra, and optimization.

**Course aims and objectives:**

- Teach basics of statistics and probability
- Teach basic methods for solving linear systems and eigensystems, and demonstrate their use in regression and data representation.
- Teach basic methods for multivariate function optimization (e.g gradient descent), and demonstrate their use in non-linear regression.

** Prerequisites:
**

- Undergraduate level probability course or statistics
- Calculus I
- Linear Algebra
- Some experience in programming: Java, C, C++, Python, R, Lua, Ruby, OCaml or similar languages, equivalent to two introductory courses in programming, such as “Introduction to Programming” and “Data Structures and Algorithms.”
- Some prerequisites may be waived with permission from the instructor.

### DS-GA-1003: Machine Learning and Computational Statistics

** Course credits: **3

**One**

Year of the Curriculum:Year of the Curriculum:

**Spring**

Semester:Semester:

**Instructor:**Professor Sontag, David

The course covers a wide variety of topics in machine learning, pattern recognition, statistical modeling, and neural computation. It covers the mathematical methods and theoretical aspects, but primarily focuses on algorithmic and practical issues.

**Course aims and objectives:**

- Teach intermediate topics in machine learning
- Provide hands-on experience in designing and programming data science algorithms

** Prerequisites:
**

- DS-GA-1001: Introduction to Data Science, or undergraduate course in Machine Learning.
- Some experience in programming: Java, C, C++, Python, R, Lua, Ruby, OCaml or similar languages, equivalent to two introductory courses in programming, such as “Introduction to Programming” and “Data Structures and Algorithms.”
- Some prerequisites may be waived with permission from the instructor.

### DS-GA-1004: Big Data

** Course credits: **3

**One**

Year of the Curriculum:Year of the Curriculum:

**Spring**

Semester:Semester:

**Instructor:**Professor Freire, Juliana

This course covers methods and tools for automatic knowledge extraction from very large datasets. Methods include on-line learning, feature hashing, class embedding, distributed databases, map-reduce framework, CUDA GPU programming, and applications.

**Course aims and objectives:**

- Teach techniques and approaches that are relevant when the data sets get very large.
- Give practical, hands-on experience with big data tools, such as map-reduce, parallel programming, on-line learning, and hashing methods.

** Prerequisites:
**

- DS-GA-1001: Introduction to Data Science or equivalent undergraduate course
- DS-GA-1002: Statistical and Mathematical Methods
- Some prerequisites may be waived with permission from the instructor.

### DS-GA-1005: Inference and Representation

** Course number: Course credits: **3

**Two**

Year of the Curriculum:Year of the Curriculum:

**Fall**

Semester:Semester:

**Instructor:**Professor Sontag, David

This course covers graphical models, causal inference, and advanced topics in statistical machine learning.

**Course aims and objectives:**

- Teach exact and approximate inference methods in graphical models.
- Teach learning techniques for graphical models and structured prediction.
- Teach methods for causal inference.

** Prerequisites:
**

- DS-GA-1004: Machine Learning and Computational Statistics

### DS-GA-1006: Capstone Project and Presentation in Data Science

** Course Number: Course Credits: **3

**Two**

Year of the Curriculum:Year of the Curriculum:

**Fall**

Semester:Semester:

The purpose of the capstone project is for the students to apply theoretical knowledge acquired during the the program to a real project involving actual data in a realistic setting. During the project, students engage in the entire process of solving a real-world data science project: from collecting and processing actual data, to applying a suitable and appropriate analytic method to the problem. Both the problem statements for the project assignments and the datasets orginate from real-world domains similar to those that students might typically encounter within industry, government, NGO, or academic research.

Depending upon a project’s complexity, students work individually or in small teams on a problem statement typically specified by an industry or governmental sponsor employing data set provided by the sponsor. Academic, governmental and NGO research groups (both from within, as well as external to NYU) may also propose projects. A list of projects will be posted early in the semester, so students can align themselves

with problems statements corresponding to their individual interests. As the project and problem statements warrant, students may be permitted to organize into teams of two to three participants. Teams larger than three will be considered for approval on a case-by-case basis, as warranted. The final problem statements and the composition of the teams will be approved by the Course Director in coordination with any relevant faculty advisor and the sponsor’s assigned representative (i.e. the sponsoring Project Coach).

Each project team will be supervised by the Course Director (in some cases with a relevant faculty advisor) and advised by the Project Coach assigned from the academic, governmental, NGO or industry sponsor.

**Here are two examples of illustrative projects: **

- An large insurance company has an anonymized dataset of workers compensation claimants. The insurance claims dataset incorporates corresponding data, e.g. claimant demographics, claims payments, etc. A team comprised of capstone students, advised by the instructor in conjunction with a technical coach from the company, employ the dataset to develop and implement an analytic solution using software tools studied in previous courses.
- A professor from the Department of Politics has a dataset about tweets from individuals, with some indication of the party affiliation of the individual. Students use text classification methods studied in class to build a system that can predict party affiliation and voting behavior from tweets in conjunction social network tools to sort tweeters.

** Course aims and objectives:
**

- Students will demonstrate an ability to handle a problem in data science from the point of problem definition through delivery of a solution. In doing so, they will demonstrate proficiency in collecting and processing real-world data, in designing the best methods to solve the problem, and in implementing a solution.
- Students will demonstrate competence in presenting material by delivering two presentations: a proposal on how to approach the problem and their final solution.
- Students will learn how to work in small teams by working with at least one other student on their project.
- Students will write a report on their project for evaluation by the instructor in consultation with the project advisors. The report will be structured as a typical research paper, and hence will include three main sections: 1. motivation and problem definition, existing approaches to the problem; 2. proposed solution; 3. results, conclusion, and directions for future work.

** Prerequisites:
**

- Successful completion of DS-GA-1001: Introduction to Data Science, DS-GA-1002: Statistical and Mathematical Methods, DS-GA-1003: Machine Learning, and DS-GA-1004: Big Data; or permission of the instructor, based on having successfully completed similar course work or gained experience in hands-on projects