NYU is excited to announce the continuation and expansion of CURP in 2022! CURP is the *Center for Data Science – Courant Undergraduate Research Program* which was launched in Spring 2021 by the Center for Data Science in partnership with the National Society of Black Physicists. This year the Center for Data Science and the Courant Institute of Mathematical Sciences will be offering more mentors from a broader range of research areas and more computing and research projects in Spring 2022!

**PROGRAM DATES:** JANUARY 24 – MAY 18, 2022

**LOCATION:** RESEARCH PROJECTS WILL TAKE PLACE REMOTELY

**FELLOWSHIP AWARD:** $3,500

**APPLICATION DEADLINE:** EXTENDED TO NOVEMBER 30, 2021

Take a look at last year’s participants from CURP’s inaugural program in Spring 2021.

## About

A strength of New York University is the active engagement of our scholarly community in a global world. As such, NYU’s community is enriched by individuals reflecting diverse sociocultural identities, perspectives, and experiences. NYU recognizes the value of a diverse community in supporting an intellectually challenging and inclusive educational environment.

In Spring 2021, NYU’s Center for Data Science partnered with the National Society of Black Physicists to offer the NYU CDS Undergraduate Research Program (CURP). This fall, CURP is a joint effort between the Center for Data Science and the Courant Institute for Mathematical Sciences, continuing in the same spirit.

CURP is a research mentorship program designed for a diverse group of undergraduate students who have completed at least two years of university-level courses and would like to conduct research in computing and data science. The objectives of CURP are to provide meaningful research opportunities to talented students as well as an opportunity to develop the necessary skills and knowledge to participate in successful research collaborations, and integrate a community of academic peers and world renowned faculty mentors who can advise, encourage, and support them. This is currently an online program, hence students can participate from anywhere in the country.

The program will run for the Spring semester from January 24 – May 18, 2022. A CURP fellowship award of $3,500 will be offered.

*This program was made possible by Capital One, DeepMind Technologies, and Moody’s. We thank them for their support.*

#### Julia Kempe Director, Center for Data Science

“We are excited to launch this unique undergraduate research program here at NYU’s Center for Data Science in close collaboration with the National Society of Black Physicists (NSBP). We hope that talented undergraduate STEM students take full advantage of this extraordinary opportunity and enjoy the exposure to cutting-edge and exciting data science research projects and mentorship by outstanding faculty in an inclusive environment designed to especially support students of diverse backgrounds.”

#### Stephon Solomon Alexander, Outgoing President of the National Society of Black Physicists

“The National Society of Black Physicists is excited to partner with NYU Center for Data Science (CDS) for the CDS Undergraduate Research Program. This opportunity will give students first hands experience and training in Big Data and Machine Learning. These skills will definitely open new opportunities in a broad range of scientific fields and technology.”

## Program Description

The Center for Data Science and the Courant Institute of Mathematical Sciences are dedicated to ensuring that its scholarly community and the fields of computing and data science is enriched by individuals who, through their various backgrounds and life experiences, contribute to an intellectually challenging and inclusive educational environment. Our priorities are to maximize opportunities for students to be connected to top-notch faculty, and to learn in an environment that embraces a diversity of perspectives and recognizes the values and unique experiences that those from historically underrepresented communities bring to the table. To that end, CURP is open to students of all backgrounds and especially encourages applications from individuals who come from diverse backgrounds and whose academic and research experience contribute significantly to the diversity and academic excellence at NYU.

Under the direction of CDS and Courant faculty, undergraduate students will complete a research project, give a presentation, and write a technical report on their research project. Students will be part of a group of peers with common interests in data. They will have the opportunity to attend talks given by leading researchers in their fields, attend workshops aimed at developing skills and techniques needed for research careers in computing and data science, and learn techniques that will prepare them for the admissions to graduate and doctoral programs as well as for fellowship applications. The students receive a $3,500 fellowship award for their participation in the program.

After the research opportunity, each student will:

- be part of a network of mentors that will provide continuous advice in the long term as the student makes progress in their studies.
- have access to faculty advice and references for graduate applications.
- be part of a network of other students of diverse backgrounds with similar interests.

## How to Apply

### Eligibility

Applicants must be enrolled in a postsecondary institution for the Spring 2022 semester and be a United States citizen or a permanent resident. We are also pursuing opportunities for undocumented and DACA students, and welcome applications from these students.

Priority will be given to current juniors and seniors.

Students should plan to commit approximately 10 hours per week to their research projects.

Participation in our virtual Python Bootcamp is mandatory and students must make themselves available the week prior to the start of the program.

### Deadline

Our application is open until all available opportunities are filled. Applications received on or before **October 31, 2021 **will receive full consideration. Applications received after October 31st will be considered on a rolling basis.

### Application Materials

Applications for CURP Spring 2022 must be submitted via the CURP application link, which lists the required application materials.

A complete application consists of four items. These items are:

**1. **Transcripts

Applicants must include a copy of their transcript showing courses and grades from all postsecondary institutions they have attended. Unofficial copies (as long as they are legible) are acceptable, though we may request official transcripts upon acceptance to the program. Transcripts should be current through Spring 2021. If possible, a list of your Fall 2021 courses should be included.

#### 2. Statement of Interest

Applicants must write a personal statement (500 – 1,000 words) addressing your research interest, particularly with respect to the faculty and topics listed below, and why you would like to participate in CURP. You may write in any style, but try to address the origins of your interest in science, experiences (school-related and other) that have particularly stimulated you, obstacles you have faced along the way, and future educational and career plans and aspirations. If you are currently attending a two-year institution, provide the name of the four-year institution to which you plan to transfer and the date when you plan to transfer in your statement.

This year’s CURP research projects and their descriptions are listed below:

Many language technology applications today rely on large-scale language models. This project investigates statistical properties of these models and suboptimal features to make them more useful in various languages.

**Prerequisites: Statistics and Python**

Social media gives us the opportunity to be involved in any discussion about trending topics. This project aims to develop tools to monitor such trends in our daily lives related to news from comments on social media.

**Prerequisites: Programming experience (previous work in web development would be a plus but not a must)**

The project will study some empirical questions related to robustness of deep learning models. Deep networks are known to be highly sensitive to various kinds of perturbations of the input examples (“adversarial examples”) or input distribution (“distribution shift”). It has been shown that adversarial examples can arise even at initialization due to the randomness of the weights, but can also be a result of spurious correlations of certain features with the label. We will study these two factors and assess their prevalence and importance on various datasets.

**Prerequisites: None**

Deep equilibrium (DEQ) models are a new class of architectures that have different properties than usual feed forward architectures, such as the ability to control the trade-off between accuracy and inference time. Starting from the DETR detection model, we can build the first DEQ model for detection and study its properties.

**Prerequisites: Pytorch and Computer Vision (preferred, not mandatory)**

In this project, we will learn how to numerically compute the trajectories of charged particles in discontinuous magnetic fields, as may occur in certain models used for the description of magnetic confinement fusion experiments. The project will involve learning the equations for the motion of charged particles in a magnetic field, and learning and developing efficient numerical methods to solve these equations with a computer.

**Prerequisites: Calculus I and II required, Elementary ordinary differential equations would be helpful, basic physics as well**

Stepped pressure equilibria have proved to be successful for the study of magnetic confinement fusion experiments. The purpose of the project is to investigate the potential of new algorithms for constructing such equilibria. In order to be able to do so, you will learn the finite element method for solving the partial differential equations appearing in stepped pressure equilibria, and learn schemes for solving the nonlinear equations determining force balance.

**Prerequisites: Calculus I and II, Elementary Physics**

Much like in other domains, such as autonomous driving, Deep Learning models are starting to have an impact in healthcare. Particularly in medical imaging, there has been an explosion of research around developing models that can analyze images to provide better diagnosis and help radiologists improve their diagnostic. However, in addition to rendering a diagnosis, these models can bring significant benefits to other aspects of radiology, such as optimizing the workflow of the radiologists, which in-turn can help improve patient care. In this project we will work on developing machine learning models for some of the problems which fall under the umbrella of workflow optimization in medical imaging, such as, building AI-based customized hanging protocols of images.

**Prerequisites: Probability and Statistics, Linear Algebra, Calculus, Data Science (optional), Introduction to Machine Learning (optional)**

Understand conditions under which “fair-weather” O(log n) efficiency of MLS can be achieved: The Messaging Layer Security (MLS) Protocol

**Prerequisites: Algorithms, Cryptography, advanced mathematics background**

We aim to characterize how humans represent tasks in naturalistic environments. An emerging literature in human reinforcement learning has established that humans represent different tasks using a small number of features. One key insight has been that this set of features changes based on experience with the reward function of the task. This finding raises the intriguing possibility that precise real-time task inference (e.g. by an artificial assistant) may be possible by identifying which features of the world humans focus on at a given point in time. However, this theoretical work has been confined to reduced laboratory settings in which the structure of the environment is artificially defined. We aim to extend this work to more realistic environments in the context of virtual reality, and ultimately to uncover a sufficient state representation that is invariant to fluctuations in reward over the short time scale of everyday tasks.

**Prerequisites: Multivariate data analysis (optional), Linear algebra (optional), Research methods (optional)**

Climate models are used to predict the response of the climate system to anthropogenic (human) influence, for example, global warming caused by our use of fossil fuels. Despite the fact that climate models are run on some of the largest and fastest computers in the world, we lack the computing power to properly simulate all the physical processes that play an important role in the climate system. In this project, we’ll focus on a particular process that is poorly simulated by our best models (gravity waves*) and use machine learning and data science to better represent their impact on the atmospheric circulation. (*Note that these gravity waves are not Einstein’s waves, but waves that are generated by mountains and storms in our atmosphere. We call them gravity waves because gravity is the restoring force!)

**Prerequisites: multivariate calculus, linear algebra, basic physics, differential equations is recommended, but not required**

Students will explore ways to optimize generated text given certain metrics (e.g. sentiment, grammaticality, or quality of the text). Relevant topics include neural language modeling, text generation, and reinforcement learning.

**Prerequisites: Statistics, Linear algebra, Multivariate calculus, (optional but preferred) NLP/Machine Learning/AI**

Not all questions posed to QA models can be answered straightforwardly. For example, the question “How old was Mark Zuckerberg when he founded Google?” cannot be answered with any age even though the questions asks for an age (“How old”) because it contains a false background assumption (“Mark Zuckerberg founded Google”). Such questions are known to pose significant challenges to existing QA systems [1,2]. This project focuses on questions that are typically categorized as “unanswerable” in current QA (questions that fail in some ways, including the aforementioned cases of assumption failure), aiming to explore ways in which QA models can repair the failures and provide more informative answers than is currently possible. Phu Mon Htut (PhD student at CDS) will also help co-mentor this project.

- Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering
- General-Purpose Question-Answering with Macaw

**Prerequisites: Statistics/Linear algebra (preferred, not mandatory)**

Graph Neural Networks (GNNs) are becoming popular all many domains of data science. For example, they have been used to do drug discovery, reasoning on social networks etc. However, training GNNs on very large-scale graphs remains slow. In this project, we’ll investigate how to accelerate GNN training. One factor affecting speed is the comparatively slow sampling process in which the system samples a batch of subgraphs for mini-batch training. Traditionally, sampling is done on CPUs and we’ll try to build a GPU-based sampling subsystem.

**Prerequisites: C++ and Python programming experience; basic knowledge of Deep Neural Network**

Kernel methods are standard and popular methods for data and model fitting. They enjoy substantial mathematical theory, are flexible and can be easily modified, and have direct implementations. However, their computational costs scale unfavorably as the number of data points increase, which is becoming more common nowadays. The purpose of this project is to explore one or several methods for lowering their computational costs, such as sketching and random Fourier features. Through this project, student(s) will gain experience working with real data and learn new computational skills.

**Prerequisites: Linear algebra and calculus. Experience with basic probability/statistics, and basic numerical analysis are helpful**

Less than 2% of our genome is protein-coding DNA; the vast expanses of non-coding DNA make up the genome’s “dark matter”, where introns, repetitive and regulatory elements reside. These “dark matter” elements play important roles in both physiological development and disease, yet the underlying regulatory logic which dictates function remains poorly understood. In our research, we use approaches from synthetic biology to generate massive biological datasets that increase both the quantity and quality of data available for analysis. We then design interpretable machine learning to determine how non-coding sequence contributes to function. In addition to bioinformatic and data analysis, students involved in this project will have the opportunity to participate in every step of the data lifecycle, from generating biological data to validating hypotheses. There is also the option of conducting experimental research (molecular biology and next-generation RNA sequencing).

**Prerequisites: None**

The task of sampling from a probability distribution with known density arises almost ubiquitously in the computational sciences, from Bayesian inference to computational chemistry and beyond. The most widely-used approach for this task is Markov-chain Monte Carlo (MCMC), in which a Markov chain is run for many steps to generate a new independent sample from the target density. Although MCMC methods are quite flexible in principle, the mixing time (i.e., the number of iterations needed to generate an independent sample) can be extremely long if the target density is poorly conditioned or multimodal. Recently, an ensemble MCMC approach [arXiv:2106.02686] was introduced to address slow mixing times, but the advantage of this approach disappears in the limit of high dimensions without careful design choices. This project will investigate machine-learning based approaches to design a successful Markov chain to fit into this ensemble framework that is robust to the high-dimensional limit. On the way, the student will learn about MCMC and generative modeling, two important paradigms for modern computational probability.

**Prerequisites: Multivariate calculus; Linear algebra; Probability; some experience with scientific computing in Python, Matlab, or Julia; any background in machine learning and/or Monte Carlo methods will be helpful, but not required**

Many methods have been developed to analyze, solve, and control linear dynamical systems; fewer methods exist to analyze nonlinear dynamical systems. Nonlinear dynamical systems can be used to model network activity in many different fields such as ecology, biology, neuroscience, and sociology. Fortunately, some nonlinear systems can be transformed into linear systems using Koopman operator theory, allowing us to apply linear analysis to systems that were originally nonlinear. We will explore how to transform nonlinear dynamical systems into linear dynamical systems and when this transformation works using a combination of analysis and data-driven discovery methods. We will investigate why certain nonlinear-to-linear transformations exist and attempt to apply such transformations to well-known nonlinear systems in biology or neuroscience.

**Prerequisites: Linear algebra, Ordinary differential equations, some Matlab or Python programming experience**

Video analytics systems are increasingly being deployed, and are used by cities to route traffic, detect crimes, etc. There are only a limited number of approaches available for testing the performance and correctness of these systems. However, the correctness of these systems depends on the input videos. This research project aims to design automated systems to generate videos to evaluate these systems.

**Prerequisites: Python, C# and C++**

This project will aim to learn dynamics of physical systems from data to make accurate predictions into the future. The focus will be on an empirical study of locally low-dimensional approximations and on applications from computational fluid dynamics.

**Prerequisites: Linear algebra, Multivariable calculus, Probability, Strong programming experience (Python). Strong math background.**

The MALACH corpus is a set of interviews with Holocaust Survivors from Steven Spielberg’s Shoah Foundation along with a set of metadata to enable research on improving speech recognition for this important corpus. Some of the challenges include accented, emotional speech and a very large number of foreign named entities such as towns, villages, and names. The most recent evaluation of speech recognition took place in 2019, where the best open-source technology available at that time was able to achieve a 20% Word Error Rate (WER). Such a number, while impressive, is not low enough to enable downstream processing using sophisticated Natural Language Processing Techniques. Since 2019, there have been major improvements in general speech recognition performance, enabled by end to end deep learning systems incorporating new technologies such as transformers and RNN-T architectures, along with significant amounts of new open source datasets that can serve as the basis for transfer learning approaches. The goal of this project is to apply the latest open source speech technology to this problem and significantly lower the WER for the MALACH corpus. The student who works on this task will become familiar with deep-learning based state of the art open-source speech processing technologies and learn how to build speech recognition systems for new and difficult tasks. To achieve these ambitious goals, an appropriate background would be a strong computing background especially in Python and shell scripting, and coursework in areas such as signal processing, probability, and machine learning.

**Prerequisites: Machine Learning, Signal processing, Probability, Natural Language Processing**

Getting robots to adapt to new environments and effectively interact with new objects is a long-standing challenge in robotics. Works in large-scale data collection and simulation to real learning have offered promise in robot generalization by fitting large parametric models on diverse robotic data. However, such methods are only able to solve simple manipulation skills such as pushing and grasping, while for more complex skills large training times in the order of a few months is often required. In this project we will build on recent research from our lab and use deep learning techniques to train robots to adapt in varying environments.

**Prerequisites: Statistics, Linear Algebra, and Calculus are optional**

In some domains, machine learning models show high performance that match the performance of humans. However, matching the performance of humans does not mean that a model makes use of the same information a human would. The project would be to explore and understand the prevalence and influence of shortcuts in healthcare with a side goal of understanding how the use of shortcuts relates to the biases that may be encoded in models.

**Prerequisites: Probability and Multivariate Calculus**

The goal of the project would be to explore how weakly supervised learning methods can benefit classification and/or segmentation task in medical image analysis.

**Prerequisites: Python, Pytorch/Tensorflow, Linear algebra**

In this project, we will work on building a dashboard interface to observe the internal operations of and control a network of programmable nodes. Each node in the network can be interactively programmed through a small program that controls how that node processes packets. We plan to use this dashboard both as a research and a teaching tool.

**Prerequisites: Programming, exposure to computer networks through internships or courses**

This project will involve creating tools to automatically generate new problems for undergraduate networking courses. The key idea in this problem generator will be to use program synthesis to create these problems.

**Prerequisites: Programming, exposure to computer networks through internships or courses**

With the advent of AI driven solutions for healthcare, this project aims to design new programming constructs to build domain specific concordant systems where the machine learning predictions are concordant with the rules and regulations followed by doctors and medical practitioners. Refer to our WSDM 2021 paper on this topic for more details.

**Prerequisites: Statistics, Machine Learning, Programming Languages, Operating Systems, Strong Programming Skills (Python, C,C++)**

Single Cell Genomics projects generate large volume of genomic data from individual biological samples. This project aims to develop new techniques to analyze these large volumes of data for a variety of statistical properties at high throughputs. This project will require the student to be conversant with the basics of bioinformatics and would also require the student to be good in programming and machine learning techniques.

**Prerequisites: Bioinformatics, Machine Learning, Strong Programming Skills (Python)**

This project aims to leverage NLP techniques to design privacy aware systems using the principles of contextual integrity, a socio-technical theory for describing privacy norms. The project involves combing NLP techniques with formal logic techniques to convert privacy policy specifications in English to be converted to formal logic representation and verify the correctness of these policies against real world information flows.

**Prerequisites: Machine Learning, NLP, Programming Languages, courses in Crypto or Security or Privacy or Formal Logic or Advanced Systems is a plus, Strong programming skills (Python + ability to learn a Functional Programming Language)**

Many AI-driven systems are unaware of the underlying specifications of the context that these systems are designed for. This project aims to design a new programming paradigm where administrators can specify causal hypotheses and constraints in their environment and the underlying AI-driven solutions for the system have to obey to specified causal constraints.

**Prerequisites: Statistics, Machine Learning, Strong Programming Skills**

This project studies the presence of chaos in models of networks of interacting neurons. A system is said to be chaotic when different, but very similar, initial configurations of the system quickly evolve into very different states. A characteristic of chaotic systems is their erratic behavior. In the first part of the project we will go through some fundamental concepts from the theory of dynamical systems and chaos, in particular Lyapunov exponents. In the second part, we will look at some models of networks of interacting neurons trying to identify mechanisms that can lead to positive Lyapunov exponents and chaos.

**Prerequisites: Calculus I and II, Linear Algebra, Ordinary Differential Equations**

This project, in collaboration with Prof. Laure Zanna, will develop and pursue Bayesian machine learning approaches, such as Bayesian neural networks and Gaussian processes, to extrapolating ocean levels across time. The project will involve many foundations — linear algebra, probabilistic methods, multivariable calculus, programming proficiency. It will also involve several areas from the data science side, including spatiotemporal modelling, representation learning, and transfer learning. We will provide resources to get up to speed, but a strong background and interest in math and programming will be helpful.

**Prerequisites: Linear algebra, Multivariable calculus, Probability, strong programming experience (Python preferred). Strong math background.**

#### 3. Description of Previous Summer / Research Experiences

Applicants must provide a brief description of all science research or summer programs (high school or college), in which they participated.

#### 4. Faculty Reference Letter of Recommendation

Applicants must designate one Faculty as a reference; this should be someone from whom you have taken a class or with whom you have done independent scientific work. Letters must be sent directly to curp@nyu.edu. It is the student’s responsibility to make sure the letter is received on time. You may include a second letter if you think it will strengthen your application significantly. The deadline for receiving letters of recommendation is **December 5, 2021**.

For additional information, please contact us at curp@nyu.edu. We look forward to seeing your application!

**Disclaimer:** This webpage is still subject to some change, but it serves as a clear representation of the process as it stands now.