CDS Seminar Series

lecture hall

The CDS Seminar Series showcases the latest developments from all areas of data science, featuring local speakers, special interdisciplinary events, and the flagship CDS Colloquium with worldwide leaders in the field. It takes place every Friday from 2:00pm–4:00pm, with talks scheduled from 2:00pm–3:30pm, followed by a reception for attendees from 3:30pm–4:00pm.

Series Overview

Launched to highlight cutting-edge research, the CDS Seminar Series brings together experts from various disciplines within data science. The series aims to be accessible to the entire CDS community, fostering collaboration and sparking new ideas across different areas of the center. By featuring a diverse range of speakers and topics, the series provides a platform for knowledge exchange and interdisciplinary dialogue.

Key Focus Areas

The seminar covers a wide range of topics in data science, including but not limited to:

  1. Machine Learning
  2. Artificial Intelligence
  3. Natural Language Processing
  4. Computer Vision
  5. Statistical Modeling
  6. Data Visualization
  7. Big Data Analytics
  8. Neural Networks
  9. Reinforcement Learning
  10. Causal Inference
  11. Computational Biology
  12. Data Ethics and Fairness
  13. Deep Learning
  14. Bayesian Methods

This seminar series serves as a bridge between different parts of CDS, encouraging cross-pollination of ideas and fostering potential collaborations within the center.

Spring 2025 Events

  • Date: March 21, 2025
    • Speaker(s): Ilia Sucholutsky & Shauli Ravfogel
    • Title: Studying representational (mis)alignment: how and why? (Ilia Sucholutsky) / Representation-space interventions and their causal implications (Shauli Ravfogel)
    • Overview:
      • Ilia Sucholutsky: Do humans and machines represent the world in similar ways? Does it really matter if we do? I’ll share a simple method that anyone in the audience can immediately start using to study the representational alignment of their AI systems, even black-box ones, with humans or other models. We’ll then explore why we would want to do this by highlighting some examples of how representational alignment helps us learn more about both humans and machines and how it relates to key downstream properties of both machine learning and machine teaching.
      • Shauli Ravfogel: I will introduce a line of work that focuses on identifying and manipulating the linear representation of specific concepts in language models. The family of techniques I will cover provides effective tools for various applications, including fairness and bias mitigation as well as causal interventions in the representation space. I will formulate the problem of linearly erasing information or steering the model towards one value or the concept, and derive closed form solutions. In the second half of the talk, I will address a key limitation of representation-space interventions—their opaqueness—and present two techniques for making these interventions interpretable by mapping them back to natural language: (1) learning a mapping from the latent space to text, or (2) deriving causally-correct counterfactual strings that correspond to a given intervention in the model.
  • Date: March 14, 2025
    • Speaker(s): Tom Griffiths
    • Title: Using cognitive science to explore the symbolic limits of large language models
    • Overview: Large language models are an impressive demonstration of how training a neural network on symbolic data can result in what looks like symbolic behavior. However, the limits of this symbolic behavior reveal some of the ways in which the underlying model and training regime can have unexpected consequences. I will highlight four such cases: ambiguous representations of numbers, influence of prior distributions on solutions to deterministic problems (“embers of autoregression”), paradoxical effects of chain-of-thought prompting, and implicit associations revealed through behavioral prompts. Each case makes use of specific tools from cognitive science — rational analysis, similarity judgments, and experimental methods designed to evaluate the impact of verbal thinking on behavior and reveal implicit biases.
    • Bio: Tom Griffiths is the Henry R. Luce Professor of Information Technology, Consciousness and Culture in the Departments of Psychology and Computer Science at Princeton University, where he is also the Director of the new AI Lab. His research explores connections between human and machine learning, using ideas from statistics and artificial intelligence to understand how people solve the challenging computational problems they encounter in everyday life. He has made contributions to the development of Bayesian models of cognition, probabilistic machine learning, nonparametric Bayesian statistics, and models of cultural evolution, and his recent work has demonstrated how methods from cognitive science can shed light on modern AI systems. Tom completed his PhD in Psychology at Stanford University in 2005, and taught at Brown University and the University of California, Berkeley before moving to Princeton. He has received awards for his research from organizations ranging from the American Psychological Association to the National Academy of Sciences and is a co-author of the book Algorithms to Live By, introducing ideas from computer science and cognitive science to a general audience.
  • Date: March 7, 2025
    • Speaker(s): Brenden Lake (CDS)
    • Title: Towards more human-like learning in machines: Bridging the data and generalization gaps
    • Overview: There is an enormous data gap between how AI systems and children learn: The best LLMs now learn language from text with a word count in the trillions, whereas it would take a child roughly 100K years to reach those numbers through speech (Frank, 2023, “Bridging the data gap”). There is also a clear generalization gap: whereas machines struggle with systematic generalization, children can excel. For instance, once a child learns how to “skip,” they immediately know how to “skip twice” or “skip around the room with their hands up” due to their compositional skills. In this talk, I’ll describe two case studies in addressing these gaps.
      • 1) The data gap: We train deep neural networks from scratch, not on large-scale data from the web, but through the eyes and ears of a single child. Using head-mounted video recordings from a child as training data (<200 hours of video slices over 26 months), we show how deep neural networks can perform challenging visual tasks, acquire many word-referent mappings, generalize to novel visual referents, and achieve multi-modal alignment. Our results demonstrate how today’s AI models are capable of learning key aspects of children’s early knowledge from realistic input.
      • 2) The generalization gap: Can neural networks capture human-like systematic generalization? We address a 35-year-old debate catalyzed by Fodor and Pylyshyn’s classic article, which argued that standard neural networks are not viable models of the mind because they lack systematic compositionality — the algebraic ability to understand and produce novel combinations from known components. We’ll show how neural networks can achieve human-like systematic generalization when trained through meta-learning for compositionality (MLC), a new method for optimizing the compositional skills of neural networks through practice. With MLC, neural networks can match human performance and solve several machine learning benchmarks.
      • Given these findings, we’ll discuss the paths forward for building machines that learn, generalize, and interact in more human-like ways based on more natural input, and for addressing classic debates in cognitive science through advances in AI.
  • Date: February, 21, 2025
    • Speaker(s): Emily Black (CDS)
    • Title: Opportunities and Tensions for Fairness in AI & GenAI Systems
    • Overview: AI and GenAI models are now ubiquitous in decision-making in high-stakes domains from healthcare to employment. Unfortunately, both AI and GenAI systems have displayed bias on the basis of race, gender, income, and other attributes.  In certain domains– particularly credit, housing, and employment—this bias is often illegal. While AI governance frameworks are rapidly evolving, some of the strongest tools we have to combat this kind of discrimination in the United States are civil rights laws dating back to the 1960s. However, crucially, much of the academic work in the AI fairness space comes into tension with these laws—potentially making several state of the art techniques to mitigate and even test for bias unusable in high stakes domains such as credit, housing, and employment.  In this talk, I’ll (1) illuminate these tensions between civil rights laws and AI debiasing methods as well as point out some technical insights into how to sidestep them, (2) demonstrate how these tensions have played out in practice in real-world high-stakes AI decision-making contexts such as fair lending, and (3) discuss some further, unique tensions to developing effective regulation to prevent AI harms for Generative AI systems.

  • Date: February 14, 2025
    • Speaker(s): Xi Chen (CDS)
    • Title: Digital Privacy in Personalized Pricing and Trustworthy Machine Learning via Blockchain
    • Overview: This talk has two parts. The first part is on digital privacy in personalized pricing. When involving personalized information, how to protect the privacy of such information becomes a critical issue in practice. In this talk, we consider a dynamic pricing problem with an unknown demand function of posted prices and personalized information. By leveraging the fundamental framework of differential privacy, we develop a privacy-preserving dynamic pricing policy, which tries to maximize the retailer revenue while avoiding information leakage of individual customers’ information and purchasing decisions. This is joint work with Prof. Yining Wang and Prof. David Simchi-Levi. The second part introduces the concept of using blockchain to create a decentralized computing market for any AI training/fine-tuning. We introduce the concept of incentive-security that incentivizes rational trainers to behave honestly for their best interest. We design a Proof-of-Learning mechanism with computational efficiency, a provable incentive-security guarantee, and controllable difficulty.  Our research also proposes an environmentally friendly verification mechanism for blockchain systems, allowing existing proof-of-work computations to be used for AI services, thus achieving useful proof-of-work. 

  • Date: January 24, 2025
    • Speaker(s): Eunsol Choi (CDS)
    • Title: Equipping LLMs for Complex Knowledge Scenarios: Interaction and Retrieval
    • Overview: Language models are increasingly used as an interface to gather information. Yet trusting the answers generated from LMs is risky, as they often contain incorrect or misleading information. Why is this happening? We identify two key issues: (1) ambiguous and underspecified user questions and (2) imperfect knowledge in LMs, especially for long tail or recent events. To address the first issue, we propose a system that can interact with users to clarify their intent before answering. By simulating their expected outcomes in the future turns, we reward LMs for generating clarifying questions and not just answering immediately. In the second part of the talk, I will discuss the state of retrieval augmentation, which is often lauded as the path to provide up-to-date, relevant knowledge to LMs. While their success is evident in scenarios where there exists a single gold document, incorporating information from a diverse set of documents remains challenging for both retrieval systems and LMs. Together, the talk highlights key research directions for building reliable LMs to answer information seeking questions. 

Fall 2024 Events

  • Date: November 8, 2024
    • Speaker(s): Tim O’Donnell (McGill University)
    • Title: Syntactic And Semantic Control Of Large Language Models Via Sequential Monte Carlo
    • Overview: A wide range of LLM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints nontrivially alters the distribution over sequences, usually making exact sampling intractable. In this work, building on the Language Model Probabilistic Programming framework of Lew et al. (2023), we develop an approach to approximate inference for controlled LLM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computation in light of new information during the course of generation. We demonstrate that our approach improves downstream performance on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis. We compare to a number of alternative and ablated approaches, showing that our accuracy improvements are driven by better approximation to the full Bayesian posterior.
  • Date: November 1, 2024
    • Speaker(s): Grace Lindsay and David Bau (of Northeastern University)
    • Title: Finding Facts and Functions, and a Fabric (joint seminar with Minds, Brains, and Machines)
    • Overview: In this talk we discuss recent work in interpreting and understanding the explicit structure of learned computations within large deep network models. We examine the localization of factual knowledge within transformer LMs, and discuss how these insights can be used to edit behavior of LLMs and multimodal diffusion models. Then we discuss recent findings on the structure of computations underlying in-context learning, and how these lead to insights about the representation and composition of functions within LLMs. Finally, time permitting, we discuss the technical challenges of doing interpretability research in a world where the most powerful models are only available via API, and we describe a National Deep Inference Fabric that will offer a transparent API standard that enables transparent scientific research on large-scale AI.
  • Date: October 25, 2024
    • Speaker(s): Eric Oermann, Krzysztof Geras, Narges Razavian
    • Title: AI + Medicine
    • Overview: A series of brief talks will be given by our Langone affiliates Dr. Eric Oermann, Dr. Krzysztof Geras, and Dr. Narges Razavian, followed by a lively panel discussion and Q&A moderated by Sumit Chopra. 

  • Date: October 4, 2024
    • Speaker(s): Cédric Gerbelot-Barrillon and Jonathan Colner
    • Title: High-dimensional optimization for multi-spiked tensor PCA (Cédric Gerbelot-Barrillon) and Leveraging foundation models to extract local administrative data (Jonathan Colner)
  • Date: September 27, 2024
    • Speaker(s): Byung-Doh Oh, Aahlad Puli, and Yuzhou Gu
    • Titles: What can linguistic data tell us about the predictions of (large) language models? (Byung-Doh Oh), Explanations that reveal all through the definition of encoding (Aahlad Puli), Community detection in the hypergraph stochastic block model (Yuzhou Gu)
  • Date: September 20, 2024
    • Speaker(s): Nick Seaver, Kyunghyun Cho, Grace Lindsay (moderator: Leif Weatherby)
    • Title: Oral History of Machine Learning – “Why We Call It “Attention” 
    • Overview: Attention has rapidly become an essential technique in AI. This conversation between Nick Seaver (Tufts, Anthropology), Kyunghyun Cho (NYU Center for Data Science), and Grace Lindsay (NYU Center for Data Science), looks back to a moment before attention became “all you need,” when attention mechanisms were imagined and introduced.

Organizers

The CDS Seminar Series is organized by Associate Professor of Mathematics and Data Science Jonathan Niles-Weed, Associate Professor of Mathematics and Data Science Carlos Fernandez-Granda, Associate Professor of Linguistics and Data Science Tal Linzen, and Assistant Professor of Mathematics and Data Science Yanjun Han.

Attendance Information

The CDS Seminar Series takes place every Friday from 2:00pm–4:00pm. Talks are scheduled from 2:00pm–3:30pm, followed by a reception for attendees from 3:30pm–4:00pm. All members of the CDS community are encouraged to attend these sessions featuring cutting-edge research and insights from leading experts in the field of data science.

Scroll to Top