Big ETL: Clearing the Confusion - NYU Center for Data Science

by: Ketan Barve, NYU Master’s Candidate

Are you a Big Data Developer often confused about which techniques to use for ETL? Traditional tools like data warehousing and data mining are becoming obsolete and it can be confusing to know which technology to use for what kind of data – welcome to ETL (Extract, Transform and Load) .

Data scientists are turning to Big ETL techniques to clean up the raw data, several which were presented in an event hosted by Simulmedia and Caserta Concepts last month.

Simulmedia and Caserta Concepts were the keynote speakers at this ETL event and presented several Big ETL techniques. These techniques could be used as a decision path for data scientists using technologies like Hadoop, MapReduce, Hive, Pig and Python .

Use of Pig, Hive, Impala and Spark to Perform ETL

Elliott Cordo, Chief Architect of Caserta Concepts, explained how Big ETL is particularly useful for parallel processing of large datasets either in semi-structured or unstructured type of data. There are many GUI based ETL tools available in the market. However, limitations of GUI based ETL prevent it from being the most used tool. Some of these limitations include lack of flexibility in custom engineering, and complex operations are often hard to represent. That is why a data scientist should invest time in using customized techniques such as:

Pig: Pig is SQL friendly, although it is often useful for pulling apart unstructured and nested data like Json or text.
Hive: Hive is more useful for dealing with tabular data or with data which can be easily projected as tabular since it is closer to SQL.
Impala: Impala is super-fast technique but not useful for heavy ETL workloads. It is a best query tool for a data prepared by upstream ETL.
Spark: Spark acts as a Swiss Knife on almost all kinds of data like streaming, batch, interactive, SQL and graph. Also, its API is available in Java, Python and Scala, making it easy and more useful for custom development.

Python Strong Option for Customizing ETL

Python was also discussed at the event as an augmentation to customized ETL techniques. Python can be used for some datasets which are hard to process. Kyle Hubert, Principal Data Architect at Simulmedia, demonstrated this concept using Hadoop Streaming. This technique seemed to be a potential option for customization, although any data scientist using this tool may need to be very comfortable with Python.

Debates on the existence of ETL

This event provided a great overview of ETL tools, which is integral to the decision making activities of a data scientist. Although, certain individuals believe that ETL is being tainted by the fact that traditional warehousing is becoming obsolete . For example, Phil Shelley, CTO of Sears holdings and CEO of Metascale, mentioned in a debate on Information Week (December 3, 2012) that ETL is expensive in terms of people, software licensing and hardware, therefore it isn’t sustainable as a cost effective tool. Also he mentioned that it is a non-value added activity which will not exist in a Hadoop data architecture.

One individual that disagrees with this theory is James Markarian, CTO of Informatica. He believes that the question is not “Are we eliminating ETL?” but “Where does ETL take place in a hadoop environment and how are we extending or changing its definition?”.

This debate will continue to arise between CTOs as ETL evolves. Undoubtedly, ETL needs to remain flexible and should adapt as per the modern application’s performance, scale and latency.

For detailed information you may refer the following presentation link available:

Presentation Slides by Elliott Cordo, Chief Architect, Caserta Concepts

To read the detailed debate on existence of ETL you may refer the following link:

Big Data Debate: End Near for ETL?