Whether it's tracking the performance of a Roth IRA or gathering demographics for product testing, data plays an important role in our personal and professional lives every day. And while data can provide helpful insights and ease everyday tasks, leveraging it for deeper insights can pose real challenges. In this Dev Delve, we’ll focus on two common issues when it comes to working with data—the prevalence of “data silos” and the work involved in solving the “data integration problem.” We’ll unpack these terms and uncover the underlying details that contribute to each of these challenges.
Data is everywhere. Software is eating the world, and data is left in it’s wake. However, there’s gold in those mountains of data.
Siloed data prevents you from getting a complete picture. You’re at the mercy of the narrow analytics within each application, and it makes data analysis that incorporates multiple sources impossible.
The data integration problem awaits, but there are solutions. Even when you have access to your data, you need to know how to connect all of that information.
Data is everywhere.
In 2011, entrepreneur and software engineer Marc Andreessen proclaimed in a now-famous article that “software is eating the world.” And, as software has worked its way into every corner of industry, a huge variety of data has accumulated—whether it’s being passively captured or actively entered. It may not be long before we see an article titled “Data is Eating the World.”
In any case, as data continues to accumulate in the wake of the great software invasion, it’s moved from playing an ancillary role to a primary role in generating value. Organizations and individuals are realizing untapped value sitting in these piles of data, while new techniques and technologies for analyzing it have begun to emerge.
What are data silos?
But as the old saying goes, there’s no such thing as a free lunch. The opportunity present in leveraging data for greater value also comes with challenges. Chief among these challenges is that data accumulates in “hard-to-reach” places, otherwise known as siloed data.
Akin to how a silo (like the one pictured above) stores, separates, and protects grains, a data silo refers to an application, database, or other data store that keeps data separated from the outside world.
There are several reasons data silos exist—from technical to political to purely incidental. (Note: For more context regarding the "whys" of these silos, see this HBR article on the subject.) Here, our focus is on what these silos means in practice, and how to overcome the hurdles they present.
In practical and technical terms, the “wall” of a data silo is made of two parts:
- difficulty in extracting data, and
- difficulty in making use of extracted data.
If you’re using an application that doesn’t have export functionality, you’re forced to go “around the application” itself and access the underlying data directly (e.g., through an SQL database used by that application). In some cases, like with many cloud-based applications, this method just isn’t possible.
If you do get past this hurdle and have direct access to an underlying data store, you may find yourself with an obfuscated or encrypted dataset, or formats that change without notice when the application is upgraded, or perhaps tables of data with seemingly random column headers which don’t provide any useful documentation. In this case, I’m afraid you’re navigating through the “Data Temple of Doom” with Dr. Jones.
Wait, what’s the problem?
As with any problem, a good first question is asking if solving this challenge is worth the effort. If you have at least some visibility into the siloed data through their respective applications, what’s the problem?
Issue 1: You’re at the mercy of the application and the insights it will allow you to actually capture from its stored data.
Though we have powerful, modern analysis techniques at our fingertips, we can’t expect every application to implement them. Instead, in a classic division of labor, an application should be focused on capturing and using quality data, while specialized tools and environments focus on analysis—allowing us to use sophisticated techniques such as regression analysis and machine learning. We don’t want to duplicate analysis efforts in every piece of software we use, which is why we need data to be portable.
Issue 2: In a world of data silos, it’s impossible to do data analysis that incorporates multiple sources.
Imagine an investigative reporter who was stuck with a single source for a story, or a prosecutor that was stuck with a single witness for trial. Creating meaningful stories and actionable insights using data requires looking at the organization and its results through many lenses. This is the only way we’re able to connect the dots between business drivers and business results. Otherwise, we’re left seeing symptoms of our successes or failures, and only guessing at the underlying causes.
Varying lenses make the same subject appear different. Similarly, looking at the organizational from different viewpoints (or rather, data sets) may change your understanding of how the organization operates.
(image source: http://www.danvojtech.cz/)
The challenge awaiting your data
Because of the increasing demand and pervasiveness of data analysis, many applications today provide a way to access data, whether that’s through raw exports or APIs. This has been a critical step in the right direction. But, even with the dots in front of us, we’re still left with the problem of connecting them. That’s why the “data integration problem” is one of the most popular and pressing topics in today’s world of data. Given all the data in these sources we can access, how can we bring all that information together to create sensible, useful results?
Connect the data dots.
Currently, the solutions we have for this problem involve transforming data into structures with common fields. Whether we’re talking about a Hadoop cluster, an SQL-based data warehouse, or an in-memory analysis tool, the process of connecting dots is a matter of finding equivalences in the data for things such as people, time, places, accounts, loans, invoices, or any other entities across the organization.
In practice, we’re having to define many rules, such as “the EID column in the user table from our accounting database should match the User ID field in the JSON we’ve pulled from the API for our web analytics software.” When it comes to the future of data science, we should anticipate some key advances around making the process of connecting dots easier, or even automated.
Once we’ve defined these equivalences and connected the dots between data sets, we can start to tell end-to-end stories across systems and answer questions, such as:
- How is our customer lifetime value related to customers’ access to our online resources?
- How are the demographics of our borrowers and depositors changing year over year?
Or, in the world of learning:
- How did participation in our new smartphone training program affect the stores lagging most in new product sales?
- Do the customer service employees who spend more time training perform better than others in their department?
Eliminating Data Silos in Practice
Use Case: SCORM & xAPI
There are many interoperability standards and processes to integrate data. Since learning interoperability standards are the ones in which we work most, I’ll use that as the example use case for how we approach breaking down siloed data.
In the world of learning technology, we’re witnessing and helping drive this progression from a world of siloed data to an integrated data ecosystem. We’re helping organizations bring together information about:
- their workforces or customer bases,
- the training programs and platforms that have been created for those audiences, and
- the end results for the business itself.
The consumers of this information find immediate value in even just the visibility they’ve been granted, to understand what their learners are doing, as well as where and when that learning is taking place.
This progression is in large part due to the shift from SCORM- to xAPI-based learning technologies, which present direct corollaries to the impact of these silos and data portability.
Just because it’s a standard doesn’t mean it provides comprehensive interoperability.
Though SCORM presented a communication standard that was crucial in terms of communication interoperability, it gave us little in the way of data interoperability.
Through the SCORM API, an LMS could launch and communicate with a piece of content in a standard way. But once that course was taken, all the information which had been captured about course completions, score results, time spent, attempts, or anything else, was persisted into wildly different underlying data stores.
For example, LMS A stores learner information in separate tables from module completions, while LMS B stores all results in a flat format. Neither LMS has any common set of column names for CMI interaction IDs or course metadata. Getting useful results data out of traditional LMSs has been difficult at best; and, when it can be done, we’re left dealing with the data integration problem and navigating the Temple of Doom to find equivalences in the data that let us connect the dots.
(Image source: Scorm.com)
In contrast, xAPI is a standard built from the ground up to support data interoperability and portability. While doing the research that led to the xAPI standard, one of the most resounding complaints we heard from the industry at large was regarding the impenetrable data silo that the LMS represented.
Beyond standardized communication, to standardized data
xAPI goes beyond the concept of a communication standard to also incorporate a flexible, standard format for the data itself. Though an LRS still has the freedom to persist the data in whatever way makes sense—from SQL tables to JSON document stores—the xAPI specification details how that data can be extracted from it.
This interface addresses both of the problems we’ve gone over in this article. Not only does the xAPI spec tell LRSs how to open up the wall of what would be yet another data silo, it also addresses the data integration problem by defining the common fields that all xAPI data can target. This means that targeting the xAPI format with a given dataset leads directly to defining the equivalences we need to tell end-to-end stories with data, across people, groups, and systems. Though defining data in xAPI terms might be work in the short term, having the data integrated pays huge dividends when it comes to analyzing it and asking questions.
Integrating a world without silos
While we’re gaining access to important data from multiple sources every day, we need to connect that data to see a complete picture and fully comprehend what’s happening. As we observe ourselves through an increasing number of lenses, we’re faced with the challenge of unifying those views at the data level. Though many avenues are being explored to ease the integration of data, targeting a common format, such as xAPI, has proven to be a viable strategy that has allowed many of us to gain novel insights across the organization.