DS-GA-1001: Introduction to Data Science

Course credits: 3
Year of the Curriculum:
One
Semester:
Fall 

Introduces students to basic software algorithms and software tools, teaches how to deal with data, representing data, and methodology. Provides hands-on experience using Torch, a software system being developed at NYU and other research centers that has a large data science library.

Course aims and objectives:

After taking this class, student should:

  • Approach business problems data-analytically. Think carefully and systematically about whether & how data can improve a particular application, to understand a phenomenon better and especially to make better-informed decisions and automated decisions.
  • Understand fundamental principles of data science, such as using data to get information about an unknown quantity of interest, calculating and using data similarity, fitting models to data, supervised and unsupervised modeling, overfitting and its avoidance, evaluation and model analytics, visualization, predictive modeling, causal inference, the data mining process, problem decomposition, data science strategy, solution deployment, and more.
  • Be able to apply the most important data science methods, using open-source tools.

Prerequisites:

  • Basic probability or statistics (undergraduate level)
  • Linear algebra
  • Some experience in programming: Java, C, C++, Python, Perl, or similar languages, equivalent to two introductory courses in programming, such as “Introduction to Programming” and “Data Structures and Algorithms.”

DS-GA-1002: Probability and Statistics for Data Science (formerly Statistical and Mathematical Methods)

Course credits: 3
Year of the Curriculum:
One
Semester:
Fall

This course introduces fundamental concepts in probability and statistics from a data-science perspective. The aim is to become familiarized with probabilistic models and statistical methods that are widely used in data analysis.

Prerequisites:

  • Calculus I
  • Linear algebra
  • Basic programming skills

DS-GA-1003: Machine Learning and Computational Statistics

Course credits: 3
Year of the Curriculum:
One
Semester:
Spring

The course covers a wide variety of topics in machine learning, pattern recognition, statistical modeling, and neural computation. It covers the mathematical methods and theoretical aspects, but primarily focuses on algorithmic and practical issues.

Course aims and objectives:

  • Teach intermediate topics in machine learning
  • Provide hands-on experience in designing and programming data science algorithms

Prerequisites:

  • DS-GA-1001: Introduction to Data Science, or undergraduate course in Machine Learning.
  • DS-GA-1002: Probability and Statistics for Data Science
  • Some experience in programming: Java, C, C++, Python, R, Lua, Ruby, OCaml or similar languages, equivalent to two introductory courses in programming, such as “Introduction to Programming” and “Data Structures and Algorithms.”
  • Some prerequisites may be waived with permission from the instructor. 

DS-GA-1004: Big Data

Course credits: 3
Year of the Curriculum:
One
Semester:
Spring

This course covers methods and tools for automatic knowledge extraction from very large datasets. Methods include on-line learning, feature hashing, class embedding, distributed databases, map-reduce framework, and applications.

Prerequisites:

  • DS-GA-1001: Introduction to Data Science or equivalent undergraduate course
  • DS-GA-1002: Probability and Statistics for Data Science
  • Some prerequisites may be waived with permission from the instructor.

DS-GA-1005: Inference and Representation

Course number: Course credits: 3
Year of the Curriculum:
Two
Semester:
Fall

This course covers graphical models, causal inference, and advanced topics in statistical machine learning.

Course aims and objectives:

  • Teach exact and approximate inference methods in graphical models.
  • Teach learning techniques for graphical models and structured prediction.
  • Teach methods for causal inference.

Prerequisites:

  • DS-GA-1004: Big Data

DS-GA-1006: Capstone Project and Presentation in Data Science

Course Number: Course Credits: 3
Year of the Curriculum:
Two
Semester:
Fall

The purpose of the capstone project is to make the theoretical knowledge acquired by the students operational in realistic settings. During the project, students see through the entire process of solving a real-world problem: from collecting and processing real-world data, to designing the best method to solve the problem, and implementing a solution. The problems and datasets come from real-world settings identical to what the student would encounter in industry, government, or academic research. Students will work individually or in small groups on a problem that typically will come from industry and involve an industry-sourced dataset, but could also be provided by academic research groups inside or outside NYU. A list of such problems will be available early in the semester and students would select a problem aligned with their personal interests. Students with similar interests could form groups of 2 or 3. The selection of problems to work on and the formation of the groups will be approved by the course director. Each program team would be supervised by the course instructor and advised by a project advisor form the academic or industry group that originated the project.

Here are two examples of illustrative projects: 

  • An large insurance company has an anonymized dataset of workers compensation claimants. The insurance claims dataset incorporates corresponding data, e.g. claimant demographics, claims payments, etc. A team comprised of capstone students, advised by the instructor in conjunction with a technical coach from the company, employ the dataset to develop and implement an analytic solution using software tools studied in previous courses.
  • A professor from the Department of Politics has a dataset about tweets from individuals, with some indication of the party affiliation of the individual. Students use text classification methods studied in class to build a system that can predict party affiliation and voting behavior from tweets in conjunction social network tools to sort tweeters.

Course aims and objectives:

  • Students will demonstrate an ability to handle a problem in data science from the point of problem definition through delivery of a solution. In doing so, they will demonstrate proficiency in collecting and processing real-world data, in designing the best methods to solve the problem, and in implementing a solution.
  • Students will demonstrate competence in presenting material by delivering two presentations: a proposal on how to approach the problem and their final solution.
  • Students will learn how to work in small teams by working with at least one other student on their project.
  • Students will write a report on their project for evaluation by the instructor in consultation with the project advisors. The report will be structured as a typical research paper, and hence will include three main sections: 1. motivation and problem definition, existing approaches to the problem; 2. proposed solution; 3. results, conclusion, and directions for future work.

Prerequisites:

  • Successful completion of DS-GA-1001: Introduction to Data Science, DS-GA-1002: Probability and Statistics for Data Science, DS-GA-1003: Machine Learning, and DS-GA-1004: Big Data