Data Science Professor Profiles: David Hogg, Physics and Astronomy

This is the fourth article in a series profiling NYU Center for Data Science professors, exploring the origins of their interest in data science and their thoughts on the Moore-Sloan Data Science Environment Initiative.

David W. Hogg, NYU Associate Professor with tenure, Department of Physics, Director of Undergraduate Studies—Physics, Center for Cosmology and Particle Physics, and Adjunct Senior Staff Scientist at the Max Planck Institute for Astronomy in Heidelberg, Germany, is a physicist, astrophysicist and cosmologist. Hogg is also Executive Director of the Moore/Sloan Data Science Environment Initiative. His work at NYU focuses on fundamental cosmological measurements, stellar dynamics and exoplanet characterization.

Originally from Toronto, Canada, Hogg obtained his Ph.D. in Physics from the California Institute of Technology and his B.S., also in Physics, from the Massachusetts Institute of Technology. In 1988, he received an Award of Merit at the International Physics Olympiad (an annual global physics competition for high school students), held that year in Bad Ischl, Austria. Widely recognized for his teaching excellence, Hogg has received New York University’s “Golden Dozen” Teaching Award, Princeton University’s Engineering Council Teaching Award and Caltech’s Undergraduate Teaching Award.

The Moore/Sloan Data Science Environment Initiative—a bold new partnership between New York University, the University of California, Berkeley and the University of Washington, supported by a $37.8 million grant from the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation—seeks to harness the potential of data scientists and big data for basic research and scientific discovery.

What role did you play in NYU’s pursuit of the Moore-Sloan award, and how did you become Executive Director?

I became the Executive Director by making the mistake of herding the cats. Writing the proposal was a challenging project because this is a group of people who, although we all know each other, are in different departments in different disciplines, with very different ideas about what constitutes a proposal. One of the things I love about being at NYU is that I’ve gotten to know people from all different areas because of our overlapping interests in statistical methods and data analysis.

I ended up taking a leadership role in making the proposal come together. The nice thing about the project is that it’s broken up into working groups with specific objectives, with a very capable leader for each of those working groups. So all I have to do is make sure the working groups come together and achieve their goals. People are so committed to this that I am actually very optimistic, and we have only just started.

Why do you think NYU was one of the three institutions chosen?

They haven’t told us in great detail why they chose us. But one thing that was definitely a factor was that they saw data science opportunities all over the University. Every component of the University was thinking about this.

Another reason, which they did mention to us, is that the administration of the University was really behind it, and involved, and understanding it. Dave McLaughlin, our Provost, has made data science one of his principal emphases here; it has been an area where he’s had a huge impact on the University.

An interesting thing about the Initiative is that they chose three institutions which are at very different stages in terms of data science. UW in Seattle is the most advanced, with an already-established eScience Institute. Berkeley’s effort, led by Nobel Prize winner Saul Perlmutter, is just starting; they have lots of very interesting data science ventures going on, but haven’t yet figured out how to connect them together. NYU is in the middle; we’re in the process of connecting our data science efforts. I think it was intentional to choose three institutions that are at three different stages of development because they want us to learn from each other; we can repeat each other’s successes and not repeat each other’s mistakes.

Will the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation be involved in the Initiative or are they merely the funding agencies?

The Moore and Sloan foundations have been extremely interactive with us in the process. It’s not just like they wrote us a check. In the initial proposal-writing stage, we talked to them twice a month by phone. They were very involved with what they wanted, and read and commented on our drafts before we submitted the final proposal. They are very encouraging, making sure the three institutions interact as a threesome.

It’s very exciting, because they want this to succeed in non-traditional ways. They want this to create big ideas. And they want to promote those big ideas. They’re particularly interested in building best practices―long-term, effective, sustainable models that will exist in all the universities in the world, eventually. They want to create a paradigm for how academic institutions can benefit from interdisciplinary interactions, as well as promote young people who work in data science, thereby making all of science more productive. They’re not just trying to get some science done. They are trying to change science.

What are some of the things the two foundations are hoping to accomplish?

Everybody in science around the world is realizing there are a set of problems that aren’t being addressed. One is that they have more data than they can comfortably deal with, and those data are complex and heterogeneous in innumerable ways.

Another is that there are young people who do brilliant things with databases and data analysis and machine learning methods, but who often get overlooked by the academic job market. So they leave and go to work for banks or dot-coms or whatever. The Moorea and Sloan people feel that with a little push in the right direction, they might have a big impact on both of these problems right now. The goal is to create ways to recognize someone who, for instance, is in the biology department but who works very closely with cutting-edge applied math, and is neither a biologist nor a mathematician. Those are our target people, the ones we are really concerned about, because they are the people who make science work, and right now they often aren’t the ones most rewarded academically.

Are you presently teaching at the Center for Data Science?

I will teach there, that’s the plan. Right now, we are figuring out how the different academic units work together, such as shared teaching (who teaches my physics courses while I’m teaching data science), do I need two offices, and how funding of research grants would get distributed. That’s another area in which the Moore and Sloan people really want to have an influence. They want to help universities figure out how to structure these agreements and interactions so that everybody feels mutually benefited.

What are the inaugural events that will take place as part of the Initiative?

We plan to do seminars, software demos and data visualization workshops. We should have our first event in March, and hope to start our seminars then also, if we can. The events will be here at NYU, but the seminars we are hoping to have mirrored in the other locations, either by simulcasting or by doing a parallel event. There is a lot of commonality between the three universities so that we can learn from each other and share resources and ideas.

How did your interest in data science come about?

I knew from a very early age that I wanted to do science or engineering for a career. I loved Lego and computers. When the Commodore PET appeared, I immediately learned how to program it, probably when I was 10 or 11.

I’m in the sweet spot for computers [born in 1970] because if you’re a little older than me, there weren’t computers until you were in your teens, which is a little bit late to really connect with them. And if you’re a little younger than me, computers became these slick black boxes that are already programmed to do things for you. So I was exactly the right age to fall in love with the computer as a programmable, reconfigurable, experimental thing―when it was at its intellectually most powerful.

My interest in data science partly came from NASA’s policy of making all of the data from their missions public. When I was a graduate student, the Hubble Space Telescope started releasing data on the Hubble Deep Field. I thought, “Well, the data are free,” and I started doing science on that data set. And that started me thinking about what are the best ways to use data, and how do we learn things from data.

Astronomy, my field, is ahead of the curve a little bit in some of these respects because we share our data, and we have enormous data sets. Everybody in the astronomy community has had to learn how to visualize, analyze and manage large amounts of data. A lot of the data analysis tools that are out there in the world come from astronomy, in some part, because we have shared our methods, our data and our code. In some sense, data science is all about sharing these things, especially across traditional disciplinary boundaries.

By ML Ball