Challenges and Achievements of a Data Scientist – An Interview with Suzanne McIntosh, Adjunct Professor at New York University and Technical Consultant and Curriculum Developer at Cloudera Inc.
Prof. Suzanne McIntosh, an Adjunct Professor at New York University and technical consultant and curriculum developer at Cloudera Inc., recently presented at the Supercomputing 2014 (SC14) conference at New Orleans. We interviewed her regarding her work in the field of data science and the Supercomputing conference.
Question:
Q. Tell us about your experience and challenges while working at IBM as a data scientist.
Answer:
S: About five years ago, I was a member of a research team composed of members from IBM Watson Research USA, IBM Research – Zurich, and IBM China Development Lab studying datacenter energy inefficiency and ways to optimize energy utilization.
A few things made our job challenging:
The data sources we needed were siloed in data warehouses and data marts.
Data warehouses have finite storage so to conserve space, data was aggregated and eventually pruned.
There was a sliding window of available information and not enough historical information at fine granularity.
Each data source was governed by a different organization
The first obstacle to overcome was to gain permission to access the data, the second was to identify ways to join data from these disparate systems in order to compose a complete picture on which we could run analytics. We were able to employ large-scale data analytics to gain actionable insights which were leveraged in taking actuation decisions. Our goal was to create a blueprint for autonomic data center energy management.
Q. What is your field of interest in data science?
S: I am most interested in distributed compute architectures, especially combined hardware/software approaches, and in leveraging virtualization and cloud computing for optimized performance.
Q. Are you mentoring any research students at NYU? If not, are you mentoring any students outside of NYU?
S: This past summer (2014), I mentored a former student who was awarded the NYU Computer Science Innovation Fellowship to continue the analytics project he developed for the course I teach.
Q. Tell us more about the Supercomputing 2014 conference you recently attended at New Orleans.
S: I recently presented on High Performance Computing (HPC) education at the Supercomputing 2014 (SC14) conference in New Orleans. I also served on the Technical Program Committee for the SC14 Analytics, Storage, and Visualization Track. The big take-away from this conference is that many HPC practitioners in academia and government labs, internationally, are beginning to use Hadoop, and even extend it to satisfy their unique Big Data needs.
Q. How do you motivate NYU students for research? What are the key features you consider while designing your course?
S: Students who enroll in my course will complete an analytics project of their own choosing and design. The students often hear me say ‘put on your research hat and take a risk’. This is important because software engineers and computer scientists are risk-averse – they generally focus on building robust and deterministic products for consumers, whether the consumers are external (paying customers) or internal (e.g. the marketing department).
Researchers, on the other hand, have the freedom to pursue the what-if questions without the types of constraints that are placed on product-focused engineers (high stability, high reliability, code maintainability, all corner cases addressed, strict project schedules, etc.).
I use my course to offer a sample of what it’s like to work as a researcher. Students research the state of the art literature, formulate their own projects, write a research paper, and present their findings. Most importantly, project formulation is an iterative process of discovery, rather than a rigid one-shot opportunity. The iterative process creates an environment that fosters risk taking.
Q. According to you, what will be the future of data science in coming years?
S: The ability to co-locate large-scale, previously siloed data has facilitated advances in the development and application of tools for knowledge discovery. Open problems remain. Collaboration between data scientists and domain experts in other disciplines is essential for solving the immediate set of challenges but anticipated future challenges precipitate the need for researchers with combined data science and domain-specific expertise.
Ms. McIntosh previously worked at IBM Watson Research as a technical team lead and research member. She extensively developed analytics for business and energy applications by using Hadoop and data warehousing tools. She is the owner of several patents for research in cloud technologies, virtualization, data processing and security. If you would like to discuss more about any of the topics discussed above or about Research on Big Data analytics and cloud based technologies with Prof. McIntosh, please contact her at mcintosh@cs.nyu.edu.
Story by Ketan Barve, master’s degree candidate and writer for the Center for Data Science.