One of my heroes is the avant garde artist Laurie Anderson. She is probably best known for the unlikely hit “Oh Superman” in the eighties and being married to Lou Reed, but I think she is an artist of comparable or even greater magnitude. On one of her later albums is a typical Laurie Anderson song called: “The Beginning of Memory”. Being a data guy this naturally piqued my interest. It was sort of a win-win scenario. The song is an account of a myth from an Ancient Greek play by Aristophanes: “The Birds”. Here are the lyrics to the song :
There’s a story in an ancient play about birds called The Birds
And it’s a short story from before the world began
From a time when there was no earth, no land
Only air and birds everywhere
But the thing was there was no place to land
Because there was no land
So they just circled around and around
Because this was before the world began
And the sound was deafening. Songbirds were everywhere
Billions and billions and billions of birds
And one of these birds was a lark and one day her father died
And this was a really big problem because what should they do with the body?There was no place to put the body because there was no earth
And finally the lark had a solution
She decided to bury her father in the back of her own head
And this was the beginning of memory
Because before this no one could remember a thing
They were just constantly flying in circles
Constantly flying in huge circles
While myths are believed to be literal truth by very few people they usually point to some more abstract and deeper truth. It is rarely clear exactly how and what it means. But I think I see the deeper point here that may actually teach us something valuable. Bear with me for a second.
The Data Deluge and The Beginning of Memory
The feeling I got from the song was eerily familiar with the feeling I get from working with Internet of Things. Our phones constantly track our movements; our cars record data on the engine and performance. Sensors that monitor us every minute of our lives are silently invading our world. When we go through the streets of Manhattan we are monitored by the NYPDs system of surveillance cameras, Alexa is listening in on our conversations and Nest thermostats sense when we are home.
This is what is frequently referred to as the Internet of things. The analogy to the story about the birds is that until now we have just been flying about in circles with no real sense of direction or persistence to our movement. What is often overlooked is that the fact that we can now measure the movement and status of things only amplifies the cacophony of the deafening sound of billions of billions of birds, sorry, devices.
This is where the birth of memory comes in. Because not until the beginning of memory do we gain firm ground under our feet. It is only with memory that we provide some persistence to our throngs of devices and their song. We capture signals and persist them in one form of memory or another.
The majority of interest in IoT is currently dedicated to exactly this process, how do we capture the data? What protocols do we use? Is MQTT better or does AMQP provide a better mechanism? What is the velocity and volume of the data? Do we capture it as a stream or as micro batches?
We also spend a great deal of time figuring out whether it is better to store in HDFS, Mongo DB, or Hbase, should we use Azure SQL Data Warehouse or Redshift or something else? We read studies about performance benchmarks and guidelines to making these choices (I do at least).
These are all worthwhile and interesting problems that also capture a large part of my time, but it also completely misses the point! If we refer back to the ancient myth, the Lark did not want to remember and persist everything, it merely wanted to persist the death of its father, it only wanted to persist something because it was something that mattered!
What Actually Matters?
And this is where we go wrong. We are just persisting the same incessant bird song frequently without pausing to think about what actually matters. We should heed the advice of the ancient myth and reflect on what is important to persist. I know this is against most received wisdom in BI and Big Data, where the mantra has been “persist as much as possible, you never know when you are going to need it”
But actually the tides are turning on that view due to a number of new limiting factors such as storage, processing and connectivity. Granted, storage is still getting cheaper and cheaper and network bandwidth more and more ample. Even processing is getting cheaper. However, if you look closely at the fine print of the cloud vendors, services that process data and move data are not all that cheap. And you do need to move the data and process it in order to do anything with it. Amazon will allow you to store anything at next to no cost in S3, but if you want to process it with Glue or query with Athena it is not so cheap.
Another emerging constraining factor is connectivity. Many devices today still connect to the Internet through the cellular network. Now, cellular networks are operated by carriers that pay good money for the frequencies used. This money is passed on to the users. On average a device is not different from a cell phone, so naturally you have to pay something close to the price of a cell phone connection, around $30 to $40. I do get the enthusiasm around billions of devices, but if the majority of these are connecting to the internet through the cellular radio spectrum, then the price is also billions of dollars.
Suddenly, the bird song is not so pleasant to most ears and our ornithological enthusiasm is significantly curbed. These trends are sufficient to warrant us starting to think about persisting only what actually matters. That can be a lot, if you really have a feasible use case for storing for example for storing all your engine data (which you might), it could also be that the 120 data points per second from your connected tooth brush may turn out to probably not matter that much.
And I haven’t even started to touch on how you would ever find sense in all the data that you persisted to memory. Most solutions do not employ adequate metadata management or data catalogs or other solutions that would tell anyone what a piece of data actually “means”. If we don’t know or have any way of knowing what a piece of data means there is absolutely no reason to store it. If you have a data feed with 20 variables but you don’t know what they are, how is it ever going to help you?
Store what matters
This can actually be turned into a rule of thumb about data storage in general: The data should be stored only to the extent that someone feels it matters enough to describe what it actually is. If no one can be bothered to pin down a description of this variable and no one can be bothered to store that description anywhere it is because it doesn’t matter.