We navigate our world through mental “maps,” which give us a way to think about concepts and the relationships among them. The better our mental maps, the better we are able to understand cause-and-effect relationships, predict outcomes, and apply our knowledge to reality. With that in mind, this article is about one such mental map that can guide how we think about accessing data and unlocking its value. Specifically, we’ll start by looking at the shape of data.
In this post, we'll start by picking out some specific data from your world, and keep that in mind as we go along. This could be anything from your inbox, a spreadsheet you’ve been working on, or maybe even a pile of xAPI statements.
What's the shape of your data?
Data doesn’t exactly live in the physical world. Sure, it’s stored on some electrical or magnetic medium, or maybe it’s even written down on paper, but for the sake of our discussion, the shape of data I’m talking about is really its digital representation. Here are some well-known, ubiquitous shapes of data that you are likely to encounter more often than not.
Tabular data is one of the first and still most fundamental representations of data today. Tables are often used to represent many instances of similar entities. For example, imagine a table of website access logs. In this case, the entity is an access event, and every event shares some common attributes—such as access time, IP address, and referring page. These entity attributes are represented as the columns of a table, while the rows represent each individual entity, and the cells of that row give us the values for that entity’s attributes.
Sometimes, though, tables might be used in other ways. Imagine a table of correlations between a set of variables. In this case, the columns and rows might both represent entities (the variables being tested), while the cells give us the value of some relationship between those entities (the correlation between those two variables). There are many such two-dimensional tables, where cells represent a value of a single attribute, and both columns and rows might represent entities or other attributes.
JSON has become the lingua franca of programming on the web, and is fundamental to how most modern REST APIs are built. Because of its ubiquitous use as a communication format for services on the web, it’s also taken hold as a very popular native format for data. Tools such as MongoDB, CouchDB, Elasticsearch, and even Postgres provide ways to store and retrieve data as JSON.
JSON provides a way to express entities, attributes, and values, just like tables. However, JSON goes further by letting us also express structure, some level of type information, and multiple values for a single attribute. By using JSON syntax, we can express nested entities and how they relate to parent entities. For example, imagine data about people that includes their mailing and billing addresses:
Notice that the structure of the mailing and billing addresses are the same, both being addresses, but it’s clear which ZIP code belongs to each address. In a single table we’d have to express these nested entities with prefixes, creating columns such as “BillingAddress.City” and “MailingAddress.Zip”. This can get unwieldy very quickly.
Multiple values for a single attribute are much easier expressed in JSON as well. For example, let’s say Bob has multiple nicknames:
Expressing this in a single table is generally not easy. Many relational databases, excel sheets, and CSV formats don’t allow it. Instead you’ll have to “pack” the values into a cell (e.g., stringing them all together and separate them by commas), and have some way to unpack the values later. JSON, however, provides a native way to handle this case.
Arguably, the most critical expression of information is text. Without a structured, written language, we wouldn’t have a platform for expressing and passing along knowledge to peers and new generations. Google wouldn’t exist, not to mention there wouldn’t be anything to Google even if it did. Even worse we wouldn’t be able to provide the many great series of Watershed blogs without it! :)
Text is often referred to as “unstructured data” because it doesn’t convey information in a format that clearly draws a line around the different entities, attributes, and values. There is a spectrum in terms of that delineation, wherein some “text” data—such as application logs—might clearly have a defined format that separates attributes and entities, while other text—such as this blog post—embeds concepts within various grammatical expressions.
But in either case, entities do exist within the text data. And attributes, values, and entity relationships often exist as well. Modern techniques exist for natural language processing and entity analysis of text. These techniques let us bring out entities from the text into another structure that can be processed in more straightforward ways. Imagine a process that could analyze simple noun, verb, object sentences and create a table of data about those statements:
Modern, natural language processing is much more sophisticated than this, but the concept is the same.
There are plenty of other formats we haven’t talked about, like the binary data that powers images, video, and audio. Due to the rise of social networks, there’s been a growing interest spurring the development of databases that can natively store and retrieve graphs.
Voice assistants, such as Amazon’s Alexa, are powering more efforts into translating audio to text to structured queries. Smartphones and virtual reality make extensive use of gyroscopic and spatial data, which may often be stored and retrieved in a native format. We don’t need to do a deep dive on each of these, but they are more examples of what it means to talk about the shape of data.
Subscribe to our blog