In this Dev Delve, we’ll explore a topic easily overlooked when viewing data science as a whole: What are the best practices for managing and transforming data to create the results that provide insights or predictions?
Data Transformation Explained
Many of those who explore data science often work with data that’s already persisted in tabular format, where columns represent dimensions of the data while rows represent individual data points for these dimensions. This tabular format is a critical part of data processing and machine learning—as many tools are optimized for working with arrays of data, and it’s the typical input for many machine learning algorithms. But our story doesn’t end with tabular data. Or, rather, it doesn’t always begin with tabular data.
Data is created in many formats and may have to go through a number of transformations in order to be analyzed. This series of transformations and processing steps are referred to as a data pipeline.
Some say 80 percent of the work in data science involves acquiring, cleaning, and transforming data. This principle suggests that getting the data ready to analyze is often more work than the actual analysis. That’s why it’s important to make shaping and moving data as easy as possible, as this reduces friction in making useful data pipelines.
Transformation of Data at Its Best
The key elements of good transformation are flexibility, interactivity, and declarative models. Watching the recent history of large-scale data processing technology gives us insight into these elements.
This includes the introduction of MapReduce as a model for using networks of commodity hardware to achieve very high levels of parallelization in analysis. This has had a huge impact on many industries by putting serious data-processing capabilities in a lot of new hands.
But data practitioners have realized a number of challenges that come with getting value out of MapReduce clusters. Running MapReduce jobs on a platform such as Hadoop can lead to considerable lag time between asking a question and getting an answer. Programming data transformations in Hadoop is very flexible, but lacks interactivity due to the way the work is carried out on the cluster.
As the core MapReduce platform was adopted, the first projects to spring up around it were things such as Pig and Hive, which promoted declarative transformations. This means instead of programming the instructions for transforming the data line by line, we declare the results we want to the processing engine, which optimizes the steps for reaching that result. It’s the difference between writing a Python script to read each row of a table and coming up with a result, versus writing a SQL query that simply declares what you want and letting the query optimizer do the rest.
Today, we’re seeing stronger interest in interactivity, which has led to growing demand for faster platforms (e.g., Spark), data science “workbenches” [e.g., RStudio (see below) or Jupyter Notebook], and even the resurgence of SQL. These all present a “live” environment for asking questions, and promotes better exploration as analysts ask series of questions that build off previous answers to understand the data.
Data Transformation Practice at Watershed
At Watershed, we always keep these elements of good transformation of data on our radar and apply them to the way we both ingest data and create valuable insights from it.
In our world, we use the xAPI data format as our core schema for analysis and transfer. Because of its growing adoption, we continue to benefit by this choice as additional systems start issuing xAPI natively. In cases where the data isn’t natively presented in xAPI, we’ve implemented a data transformation tool into our product that will translate and import tabular data (in the form of CSVs) into the LRS using the xAPI format.
We’ve made it our goal to create a very flexible import tool by allowing users to create transformations using the popular “handlebars” template language. This also presents a declarative interface, where the user tells Watershed how to transform data, and the internals of the import process can evolve and improve in terms of how to accomplish it. And, by including a way to preview the transformations when writing these templates, we’ve added the critical interactivity needed to quickly iterate and ensure you’ll get the right data imported into the learning record store (LRS).
These principles are even more important as we analyze and aggregate the xAPI data stored in the LRS. In this case, we implement flexibility with a report configuration language that allows users to tell us, in a declarative way, how data should be filtered, organized, ordered, and presented. While the framework for how data is addressed in Watershed is very specific, the ways to use it are very open-ended—which allows us to answer many types of questions with the same tool. The speed at which we can deliver results is particularly important as users preview and refine these report configurations in our Explore feature, because it enables the crucial interactivity element, allowing users to ask a series of questions in rapid succession.
Iteration Leads to Innovation
When you’re thinking about the processes that govern your own data, remember to focus on the flexibility, interactivity, and declarative aspects of the transformations they undergo. Each of these elements is important in their own ways, and bringing them together allows for the sort of open-ended, iterative development that’s key to driving innovation through data.