The NYU Center for Data Science is excited to announce the continuation and expansion of CURP in summer 2023!
PROGRAM DATES: JUNE 12 – AUGUST 18, 2023
LOCATION: RESEARCH PROJECTS WILL TAKE PLACE REMOTELY
TIME COMMITMENT: 35 HOURS/WEEK
FELLOWSHIP AWARD: $8,000
APPLICATION DEADLINE: MONDAY, JANUARY 16, 2023
CURP, the Center for Data Science Undergraduate Research Program, was launched in spring 2021 in partnership with the National Society of Black Physicists with the following objectives in mind:
- to provide opportunities for students to be connected to top-notch faculty
- to provide talented undergraduate students with a rigorous research environment that embraces a diversity of perspectives and unique experiences of those from underrepresented and under-served communities
To this end, CURP is open to all students but especially encourages individuals from diverse backgrounds to apply.
During the program, students will work on assigned research projects under the supervision of CDS faculty. They will have the opportunity to attend talks given by leading researchers in their fields and workshops to develop skills needed for research careers in data science. They will also learn techniques to prepare for admission to graduate programs and to apply for fellowships.
After the research opportunity, each student will:
- be part of a network of mentors who will provide continuous advice in the long term as the student makes progress in their studies.
- be part of a network of students from diverse backgrounds with similar interests.
How to Apply
- Applicants must be enrolled in a postsecondary institution or a recent graduate and be United States citizens or permanent residents. We are also pursuing opportunities for undocumented and DACA students and welcome applications from these students.
- Priority will be given to rising juniors and seniors.
- Able to commit approximately 35 hours per week (equivalent to full time) to their research projects.
- Participation in our virtual Python Bootcamp is mandatory, and students must make themselves available the week before the start of the program.
Applications were due on Monday, January 16, 2023.
Applications for CURP summer 2023 must be submitted via the CURP application link and contain the required application materials.
A complete application consists of the following items:
Applicants must include a copy of their transcript showing courses and grades from all postsecondary institutions they have attended. Unofficial copies (as long as they are legible) are acceptable, though we may request official transcripts upon acceptance to the program. Transcripts should be current through fall 2022.
2. Statement of Interest
Applicants *must* write a personal statement addressing: 1) why you would like to participate in CURP, 2) what you could bring to the CURP Program in terms of your unique experience, perspectives and background that could contribute to diversity/inclusion. You *may* also address the origins of your interest in science, experiences (school-related and other) that have particularly stimulated you, obstacles you have faced along the way and how you overcame them, and future educational and career plans and aspirations. If you are currently attending a two-year institution, provide the name of the four-year institution to which you plan to transfer and the date when you plan to transfer in your statement. The personal statement should be no longer than 2 pages.
3. Concise Research Interest Statement
Applicants “must” submit a short and concise research interest statement addressing: 1) the data science research areas that applicants are interested in and 2) skills and educational background to successfully conduct research in the interested area. This should not be more than one page.
4. Faculty Reference Contact Information
Applicants must designate one faculty as a reference. The recommender should be someone from whom you have taken a class or with whom you have done independent scientific work. They are not required to submit a recommendation letter, but if needed, they are available to be your reference.
If you have any questions, please contact firstname.lastname@example.org.
Jacopo Cirrone: Artificial Intelligence and Machine Learning Approaches to Understand Autoimmunity
Artificial intelligence has the potential to revolutionize how healthcare is delivered. It can improve patient outcomes overall through the enhancement of preventive care and quality of life, as well as providing accurate diagnoses and treatment strategies. One of the most popular healthcare datasets is based on Electronic Health Records (EHRs) which contains important information about patients’ health conditions and the care they receive. This research project will be focused on EHR data related to Autoimmune disease which is a condition that occurs when our immune system mistakenly attacks our own body. The student who works on this project will mine clinical data to predict which patients with a particular autoimmune disease will respond to a specific type of treatment. Moreover, the focus will be on building machine learning algorithms that will attempt to predict how a patient will respond to a given treatment. The model will first prove itself by independently arriving at conclusions that match the clinical outcomes in the EHR; once validated, the model will examine the most influential factors that predict a patient’s response to a given drug.
George Wood: Discrimination in Policing
This project will investigate racial discrimination in police behavior. Existing research shows racial disparities in arrests, stop and search, and the use of force, with Black civilians subjected to these actions at considerably higher rates. Discrimination has a direct adverse impact on the people and communities subjected to higher levels of police interference and is likely to impact distrust in the police, which is widespread and consequential for public safety. This project will examine discrimination using a massive dataset on officer deployments, stop and searches, arrests, tickets, and the use of force in Chicago. The project will focus on documenting racial disparities in police-civilian interactions using a rigorous, data-driven approach. One particular emphasis of the project will be examining whether the extent of discrimination has changed in recent years.
Brian McFee: Reconstructing audio signals from useful representations
This project will investigate and quantify the potential for audio signals to be reconstructed from lossy representations used in modern machine learning architectures for audio analysis. The eventual goal is to determine if we can find a representation that works well for a given downstream analysis task—e.g., sound event detection or music instrument recognition—without completely divulging the content of the original audio signal. The results will have implications for privacy preservation, copyright protection, and distribution of open-access audio datasets.
SueYeon Chung: On the Inductive Bias of Gradient Descent and Representation Geometry
Recent theoretical results suggest that gradient descent learning results in an implicit inductive bias in the weights of the network, such that the norms of the weights are minimized. Meanwhile, recent empirical findings suggest that hierarchical processing stages in deep networks geometrically transform feature representations such that “manifolds” corresponding to different categories become more separable. This project aims to connect these findings by exploring how the changes in the weight norms during learning contribute to representation untangling, if at all.
Zhengyuan Zhou: Learning to Recommend Academic Articles
In this project, you will help build a recommendation engine that provides personalized article recommendation. The content to be recommended come exclusively from academic articles (including archived academic papers or blog articles on towardsdatascience, as a few examples) and are geared towards people in the academic community. The project has several components, including crawling content from article sources, content understanding and interacting with users to adaptively improve on personalized recommendation.
Dennis Shasha: Automated public health decision tree
Stepped pressure equilibria have proved to be successful for the study of magnetic confinement fusion eWe have developed a tool that takes a text input and converts it into a dynamic decision tree. The candidate would use that tool to guide normal citizens to improve their health or to navigate the health bureaucracy. The main job of the student would be to assemble content in the proper form and to test the resulting dynamic decision tree with normal people. Some data preparation in python or some programming language would be helpful.
Todd Gureckis: Teaching and learning from others in humans and machines
Reinforcement learning (RL) typically concerns how humans and machines learn through trial-and-error interaction with their environment. However, a lot of what humans learn is instead transmitted through language: we give and receive instructions, explanations, and hints that can enable us to perform a task well even on our first try. Standard RL models have no way of explaining how they perform RL tasks (i.e., summarizing their action policies in language), nor can they adjust their behavior based on others’ instructions. In this project, we plan to test how humans achieve these feats and develop new models with more human-like capacities.
Duygu Ataman, Courant Computer Science: Language Modeling
Many language technology applications today rely on large-scale language models. This project investigates statistical properties of these models and suboptimal features to make them more useful in various languages.
Duygu Ataman, Courant Computer Science: Trend Monitoring on Social Media
Social media gives us the opportunity to be involved in any discussion about trending topics. This project aims to develop tools to monitor such trends in our daily lives related to news from comments on social media.
Alberto Bietti, Center for Data Science: Robustness of Deep Learning Models
The project will study some empirical questions related to robustness of deep learning models. Deep networks are known to be highly sensitive to various kinds of perturbations of the input examples (“adversarial examples”) or input distribution (“distribution shift”). It has been shown that adversarial examples can arise even at initialization due to the randomness of the weights, but can also be a result of spurious correlations of certain features with the label. We will study these two factors and assess their prevalence and importance on various datasets.
Nicolas Carion, Courant Computer Science: Deep Equilibrium Models for Object Detection
Deep equilibrium (DEQ) models are a new class of architectures that have different properties than usual feed forward architectures, such as the ability to control the trade-off between accuracy and inference time. Starting from the DETR detection model, we can build the first DEQ model for detection and study its properties.
Antoine Cerfon, Courant Mathematics #1: Charged Particle Dynamics in Discontinuous Magnetic Fields
In this project, we will learn how to numerically compute the trajectories of charged particles in discontinuous magnetic fields, as may occur in certain models used for the description of magnetic confinement fusion experiments. The project will involve learning the equations for the motion of charged particles in a magnetic field, and learning and developing efficient numerical methods to solve these equations with a computer.
Antoine Cerfon, Courant Mathematics #2: Numerical Construction of Stepped Pressure Plasma Equilibria
Stepped pressure equilibria have proved to be successful for the study of magnetic confinement fusion experiments. The purpose of the project is to investigate the potential of new algorithms for constructing such equilibria. In order to be able to do so, you will learn the finite element method for solving the partial differential equations appearing in stepped pressure equilibria, and learn schemes for solving the nonlinear equations determining force balance.
Sumit Chopra, Courant Computer Science: Deep Learning Models for Classifying and Hanging Radiology Images
Much like in other domains, such as autonomous driving, Deep Learning models are starting to have an impact in healthcare. Particularly in medical imaging, there has been an explosion of research around developing models that can analyze images to provide better diagnosis and help radiologists improve their diagnostic. However, in addition to rendering a diagnosis, these models can bring significant benefits to other aspects of radiology, such as optimizing the workflow of the radiologists, which in-turn can help improve patient care. In this project we will work on developing machine learning models for some of the problems which fall under the umbrella of workflow optimization in medical imaging, such as, building AI-based customized hanging protocols of images.
Yevgeniy Dodis, Courant Computer Science (9/30 asked for a few more sentences): Average Case Efficiency of MLS
Understand conditions under which “fair-weather” O(log n) efficiency of MLS can be achieved: The Messaging Layer Security (MLS) Protocol
Kara Emery, Center for Data Science : Task Representation in Virtual Reality
We aim to characterize how humans represent tasks in naturalistic environments. An emerging literature in human reinforcement learning has established that humans represent different tasks using a small number of features. One key insight has been that this set of features changes based on experience with the reward function of the task. This finding raises the intriguing possibility that precise real-time task inference (e.g. by an artificial assistant) may be possible by identifying which features of the world humans focus on at a given point in time. However, this theoretical work has been confined to reduced laboratory settings in which the structure of the environment is artificially defined. We aim to extend this work to more realistic environments in the context of virtual reality, and ultimately to uncover a sufficient state representation that is invariant to fluctuations in reward over the short time scale of everyday tasks.
Ed Gerber, Joint CDS-CS: Improving Climate Models with Data Science
Climate models are used to predict the response of the climate system to anthropogenic (human) influence, for example, global warming caused by our use of fossil fuels. Despite the fact that climate models are run on some of the largest and fastest computers in the world, we lack the computing power to properly simulate all the physical processes that play an important role in the climate system. In this project, we’ll focus on a particular process that is poorly simulated by our best models (gravity waves*) and use machine learning and data science to better represent their impact on the atmospheric circulation. (*Note that these gravity waves are not Einstein’s waves, but waves that a generated by mountains and storms in our atmosphere. We call them gravity waves because gravity is the restoring force!)
He He, Joint CDS-CS: Controllable Text Generation
Students will explore ways to optimize generated text given certain metrics (e.g. sentiment, grammaticality, or quality of the text). Relevant topics include neural language modeling, text generation, and reinforcement learning.
Najoung Kim, Center for Data Science: Improving the Treatment of Unanswerable Questions and Their Repair Strategies in Question-Answering
Not all questions posed to QA models can be answered straightforwardly. For example, the question “How old was Mark Zuckerberg when he founded Google?” cannot be answered with any age even though the questions asks for an age (“How old”) because it contains a false background assumption (“Mark Zuckerberg founded Google”). Such questions are known to pose significant challenges to existing QA systems [1,2]. This project focuses on questions that are typically categorized as “unanswerable” in current QA (questions that fail in some ways, including the aforementioned cases of assumption failure), aiming to explore ways in which QA models can repair the failures and provide more informative answers than is currently possible. Phu Mon Htut (PhD student at CDS) will also help co-mentor this project.
 Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering
 General-Purpose Question-Answering with Macaw
Jinyang Li, Courant Computer Science: Accelerating Graph Neural Network Training
Graph Neural Networks (GNNs) are becoming popular all many domains of data science. For example, they have been used to do drug discovery, reasoning on social networks etc. However, training GNNs on very large-scale graphs remains slow. In this project, we’ll investigate how to accelerate GNN training. One factor affecting speed is the comparatively slow sampling process in which the system samples a batch of subgraphs for mini-batch training. Traditionally, sampling is done on CPUs and we’ll try to build a GPU-based sampling subsystem.
Weilin Li, Courant Mathematics: Speeding Up Kernel Methods for Large Scale Computation
Kernel methods are standard and popular methods for data and model fitting. They enjoy substantial mathematical theory, are flexible and can be easily modified, and have direct implementations. However, their computational costs scale unfavorably as the number of data points increase, which is becoming more common nowadays. The purpose of this project is to explore one or several methods for lowering their computational costs, such as sketching and random Fourier features. Through this project, student(s) will gain experience working with real data and learn new computational skills.
Susan Liao, Courant Computer Science: Harnessing Machine Learning to Decode the Dark Matter of the Genome
Less than 2% of our genome is protein-coding DNA; the vast expanses of non-coding DNA make up the genome’s “dark matter”, where introns, repetitive and regulatory elements reside. These “dark matter” elements play important roles in both physiological development and disease, yet the underlying regulatory logic which dictates function remains poorly understood. In our research, we use approaches from synthetic biology to generate massive biological datasets that increase both the quantity and quality of data available for analysis. We then design interpretable machine learning to determine how non-coding sequence contributes to function. In addition to bioinformatic and data analysis, students involved in this project will have the opportunity to participate in every step of the data lifecycle, from generating biological data to validating hypotheses. There is also the option of conducting experimental research (molecular biology and next-generation RNA sequencing).
Michael Lindsey, Courant Mathematics: Learning Ensemble Samplers for High-Dimensional Probability Distributions
The task of sampling from a probability distribution with known density arises almost ubiquitously in the computational sciences, from Bayesian inference to computational chemistry and beyond. The most widely-used approach for this task is Markov-chain Monte Carlo (MCMC), in which a Markov chain is run for many steps to generate a new independent sample from the target density. Although MCMC methods are quite flexible in principle, the mixing time (i.e., the number of iterations needed to generate an independent sample) can be extremely long if the target density is poorly conditioned or multimodal. Recently, an ensemble MCMC approach [arXiv:2106.02686] was introduced to address slow mixing times, but the advantage of this approach disappears in the limit of high dimensions without careful design choices. This project will investigate machine-learning based approaches to design a successful Markov chain to fit into this ensemble framework that is robust to the high-dimensional limit. On the way, the student will learn about MCMC and generative modeling, two important paradigms for modern computational probability.
Megan Morrison, Courant Mathematics: Solving Dynamical Systems Using Koopman Operator Theory
Many methods have been developed to analyze, solve, and control linear dynamical systems; fewer methods exist to analyze nonlinear dynamical systems. Nonlinear dynamical systems can be used to model network activity in many different fields such as ecology, biology, neuroscience, and sociology. Fortunately, some nonlinear systems can be transformed into linear systems using Koopman operator theory, allowing us to apply linear analysis to systems that were originally nonlinear. We will explore how to transform nonlinear dynamical systems into linear dynamical systems and when this transformation works using a combination of analysis and data-driven discovery methods. We will investigate why certain nonlinear-to-linear transformations exist and attempt to apply such transformations to well-known nonlinear systems in biology or neuroscience.
Aurojit Panda, Courant Computer Science: Fuzz Testing Video Analytics
Video analytics systems are increasingly being deployed, and are used by cities to route traffic, detect crimes, etc. There are only a limited number of approaches available for testing the performance and correctness of these systems. However, the correctness of these systems depends on the input videos. This research project aims to design automated systems to generate videos to evaluate these systems.
Benjamin Peherstorfer, Courant Computer Science: Learning How to Simulate Time-Dependent Physical Systems
This project will aim to learn dynamics of physical systems from data to make accurate predictions into the future. The focus will be on an empirical study of locally low-dimensional approximations and on applications from computational fluid dynamics.
Michael Picheny, Joint CDS-CS: Improving Speech Recognition Performance for Interviews with Holocaust Survivors
The MALACH corpus is a set of interviews with Holocaust Survivors from Steven Spielberg’s Shoah Foundation along with a set of metadata to enable research on improving speech recognition for this important corpus. Some of the challenges include accented, emotional speech and a very large number of foreign named entities such as towns, villages, and names. The most recent evaluation of speech recognition took place in 2019, where the best open-source technology available at that time was able to achieve a 20% Word Error Rate (WER). Such a number, while impressive, is not low enough to enable downstream processing using sophisticated Natural Language Processing Techniques. Since 2019, there have been major improvements in general speech recognition performance, enabled by end to end deep learning systems incorporating new technologies such as transformers and RNN-T architectures, along with significant amounts of new open source datasets that can serve as the basis for transfer learning approaches. The goal of this project is to apply the latest open source speech technology to this problem and significantly lower the WER for the MALACH corpus. The student who works on this task will become familiar with deep-learning based state of the art open-source speech processing technologies and learn how to build speech recognition systems for new and difficult tasks. To achieve these ambitious goals, an appropriate background would be a strong computing background especially in Python and shell scripting, and coursework in areas such as signal processing, probability, and machine learning.
Lerrel Pinto, Courant Computer Science: Robot Adaptation from Pixels
Getting robots to adapt to new environments and effectively interact with new objects is a long-standing challenge in robotics. Works in large-scale data collection and simulation to real learning have offered promise in robot generalization by fitting large parametric models on diverse robotic data. However, such methods are only able to solve simple manipulation skills such as pushing and grasping, while for more complex skills large training times in the order of a few months is often required. In this project we will build on recent research from our lab and use deep learning techniques to train robots to adapt in varying environments.
Rajesh Ranganath, Joint CDS-CS: Shortcut Learning and Causality in Healthcare
In some domains, machine learning models show high performance that match the performance of humans. However, matching the performance of humans does not mean that a model makes use of the same information a human would. The project would be to explore and understand the prevalence and influence of shortcuts in healthcare with a side goal of understanding how the use of shortcuts relates to the biases that may be encoded in models.
Anirudh Sivaraman, Courant Computer Science #1: A Dashboard for Observing and Controlling Programmable Networks
In this project, we will work on building a dashboard interface to observe the internal operations of and control a network of programmable nodes. Each node in the network can be interactively programmed through a small program that controls how that node processes packets. We plan to use this dashboard both as a research and a teaching tool.
Anirudh Sivaraman, Courant Computer Science #2: Automatic Problem Creation for Networking Courses
This project will involve creating tools to automatically generate new problems for undergraduate networking courses. The key idea in this problem generator will be to use program synthesis to create these problems.
Elena Sizikova, Center for Data Science (9/30 asked for a few more sentences): Using Weakly Supervised Learning for Medical Image Analysis
The goal of the project would be to explore how weakly supervised learning methods can benefit classification and/or segmentation task in medical image analysis.
Lakshminarayanan Subramanian, Joint CDS-CS #1: Domain Specific Concordance AI Systems for Healthcare
With the advent of AI driven solutions for healthcare, this project aims to design new programming constructs to build domain specific concordant systems where the machine learning predictions are concordant with the rules and regulations followed by doctors and medical practitioners. Refer to our WSDM 2021 paper on this topic for more details.
Lakshminarayanan Subramanian, Joint CDS-CS #2: High Throughput Single Cell Genomics Analytics
Single Cell Genomics projects generate large volume of genomic data from individual biological samples. This project aims to develop new techniques to analyze these large volumes of data for a variety of statistical properties at high throughputs. This project will require the student to be conversant with the basics of bioinformatics and would also require the student to be good in programming and machine learning techniques.
Lakshminarayanan Subramanian, Joint CDS-CS #3: Privacy Aware Systems Using Contextual Integrity
Lakshminarayanan Subramanian, Joint CDS-CS #4: Bridging Causal Specifications in AI-driven Systems
Many AI-driven systems are unaware of the underlying specifications of the context that these systems are designed for. This project aims to design a new programming paradigm where administrators can specify causal hypotheses and constraints in their environment and the underlying AI-driven solutions for the system have to obey to specified causal constraints.
Matteo Tanzi, Courant Mathematics: Signatures of Chaos in Models of Interacting Neurons
This project studies the presence of chaos in models of networks of interacting neurons. A system is said to be chaotic when different, but very similar, initial configurations of the system quickly evolve into very different states. A characteristic of chaotic systems is their erratic behavior. In the first part of the project we will go through some fundamental concepts from the theory of dynamical systems and chaos, in particular Lyapunov exponents. In the second part, we will look at some models of networks of interacting neurons trying to identify mechanisms that can lead to positive Lyapunov exponents and chaos.
Andrew Wilson, Joint CDS-CS and Laure Zanna, Courant Mathematics: Bayesian Machine Learning Applied to Climate Modelling
This project, in collaboration with Prof. Laure Zanna, will develop and pursue Bayesian machine learning approaches, such as Bayesian neural networks and Gaussian processes, to extrapolating ocean levels across time. The project will involve many foundations — linear algebra, probabilistic methods, multivariable calculus, programming proficiency. It will also involve several areas from the data science side, including spatiotemporal modelling, representation learning, and transfer learning. We will provide resources to get up to speed, but a strong background and interest in math and programming will be helpful.
- Carlos Fernandez-Granda: Deep learning for upper body movement in stroke patients and deep learning for microscopy
- Julia Kempe: Catastrophic forgetting and/or machine learning
- Jonathan Niles-Weed: Learning from randomly shuffled data
- Cristina Savin: Online learning in recurrent neural networks
- Sarah Shugars: Social media, political discourse, or computational social science
- Elena Sizikova: Understanding where neural networks look to make decisions with applications to medical imaging
- Andrew Gordon Wilson: Bayesian machine learning
- Laure Zanna: Machine Learning for Climate
- Wenda Zhou: Machine Learning for chemistry and force-field learning
What CURP Scholars Have to Say…
Zoga Duka, 2022 CURP Scholar
As a CURP fellow, I am learning in-demand tech skills that will help me in my career. I also learned the importance of asking questions to further my knowledge in my research topics, which has given me an advantage compared to my peers. NYU has given me access to their resources.
Vishweshwar Ramanakumar, 2022 CURP Scholar
I believe part time, in-semester research programs like CURP should be more common… While I also enjoyed my time participating in an REU last summer, I feel like providing an entire semester for a project gives me the time to take a deep breath, soak in the opportunity, and really enjoy the process of research.
John Como, 2022 CURP Scholar
CURP has provided me with the unique opportunity to conduct research under the guidance of outstanding faculty. Not only have I learned about the research topic, but have also learned how to become a stronger researcher for the future.
Kennedy Sleet, 2021 CURP Scholar
I had the opportunity to work alongside NYU’s Center for Data Science through their amazing program CURP in 2021. Throughout this experience, I learned so much in regard to my career and future goals. My project focused on developing code used to identify and visualize different molecules, atoms, and elements. It was extremely interesting learning about datasets and statistics. This specific project and opportunity gave me a better understanding of how coding and machine learning can have such a huge impact on expanding subjects such as chemistry. Overall, I am extremely grateful for the experience and it helped me further my ideas for my career. I met some amazing hard-working individuals including my mentor. Furthermore, I highly recommend this opportunity to those wanting to expand their knowledge and futuristic ideas regarding computer science, physics, astronomy, data science, and more!
Isaac Robinson, 2021 CURP Scholar
CURP taught me how to design a research process and work with a team. It taught me to grapple with questions without clear answers, to work smarter and harder towards a goal I defined myself, and most of all, how to be a part of a scholarly community. A life-changing experience I would highly recommend!
Disclaimer: This webpage is subject to some change, but it serves as a clear representation of the process as it stands now.