The Data Deluge, Birds and the Beginning of Memory

One of my heroes is the avant garde artist Laurie Anderson. She is probably best known for the unlikely hit “Oh Superman”  in the eighties and being married to Lou Reed, but I think she is an artist of comparable or even greater magnitude. On one of her later albums is a typical Laurie Anderson song called: “The Beginning of Memory”. Being a data guy this naturally piqued my interest. It was sort of a win-win scenario. The song is an account of a myth from an Ancient Greek play by Aristophanes: “The Birds”. Here are the lyrics to the song :

There’s a story in an ancient play about birds called The Birds
And it’s a short story from before the world began
From a time when there was no earth, no land
Only air and birds everywhere

But the thing was there was no place to land
Because there was no land
So they just circled around and around
Because this was before the world began

And the sound was deafening. Songbirds were everywhere
Billions and billions and billions of birds

And one of these birds was a lark and one day her father died
And this was a really big problem because what should they do with the body?There was no place to put the body because there was no earth

And finally the lark had a solution
She decided to bury her father in the back of her own head
And this was the beginning of memory
Because before this no one could remember a thing
They were just constantly flying in circles
Constantly flying in huge circles

While myths are believed to be literal truth by very few people they usually point to some more abstract and deeper truth. It is rarely clear exactly how and what it means. But I think I see the deeper point here that may actually teach us something valuable. Bear with me for a second.

The Data Deluge and The Beginning of Memory

The feeling I got from the song was eerily familiar with the feeling I get from working with Internet of Things. Our phones constantly track our movements; our cars record data on the engine and performance. Sensors that monitor us every minute of our lives are silently invading our world. When we go through the streets of Manhattan we are monitored by the NYPDs system of surveillance cameras, Alexa is listening in on our conversations and Nest thermostats sense when we are home.

This is what is frequently referred to as the Internet of things. The analogy to the story about the birds is that until now we have just been flying about in circles with no real sense of direction or persistence to our movement. What is often overlooked is that the fact that we can now measure the movement and status of things only amplifies the cacophony of the deafening sound of billions of billions of birds, sorry, devices.

This is where the birth of memory comes in. Because not until the beginning of memory do we gain firm ground under our feet. It is only with memory that we provide some persistence to our throngs of devices and their song. We capture signals and persist them in one form of memory or another.

The majority of interest in IoT is currently dedicated to exactly this process, how do we capture the data? What protocols do we use? Is MQTT better or does AMQP provide a better mechanism? What is the velocity and volume of the data? Do we capture it as a stream or as micro batches?

We also spend a great deal of time figuring out whether it is better to store in HDFS, Mongo DB, or Hbase, should we use Azure SQL Data Warehouse or Redshift or something else? We read studies about performance benchmarks and guidelines to making these choices (I do at least).

These are all worthwhile and interesting problems that also capture a large part of my time, but it also completely misses the point! If we refer back to the ancient myth, the Lark did not want to remember and persist everything, it merely wanted to persist the death of its father, it only wanted to persist something because it was something that mattered!

What Actually Matters?

And this is where we go wrong. We are just persisting the same incessant bird song frequently without pausing to think about what actually matters. We should heed the advice of the ancient myth and reflect on what is important to persist. I know this is against most received wisdom in BI and Big Data, where the mantra has been “persist as much as possible, you never know when you are going to need it”

But actually the tides are turning on that view due to a number of new limiting factors such as storage, processing and connectivity. Granted, storage is still getting cheaper and cheaper and network bandwidth more and more ample. Even processing is getting cheaper. However, if you look closely at the fine print of the cloud vendors, services that process data and move data are not all that cheap. And you do need to move the data and process it in order to do anything with it. Amazon will allow you to store anything at next to no cost in S3, but if you want to process it with Glue or query with Athena it is not so cheap.

Another emerging constraining factor is connectivity. Many devices today still connect to the Internet through the cellular network. Now, cellular networks are operated by carriers that pay good money for the frequencies used. This money is passed on to the users. On average a device is not different from a cell phone, so naturally you have to pay something close to the price of a cell phone connection, around $30 to $40. I do get the enthusiasm around billions of devices, but if the majority of these are connecting to the internet through the cellular radio spectrum, then the price is also billions of dollars.

Suddenly, the bird song is not so pleasant to most ears and our ornithological enthusiasm is significantly curbed. These trends are sufficient to warrant us starting to think about persisting only what actually matters. That can be a lot, if you really have a feasible use case for storing for example for storing all your engine data (which you might), it could also be that the 120 data points per second from your connected tooth brush may turn out to probably not matter that much.

And I haven’t even started to touch on how you would ever find sense in all the data that you persisted to memory. Most solutions do not employ adequate metadata management or data catalogs or other solutions that would tell anyone what a piece of data actually “means”. If we don’t know or have any way of knowing what a piece of data means there is absolutely no reason to store it. If you have a data feed with 20 variables but you don’t know what they are, how is it ever going to help you?

Store what matters

This can actually be turned into a rule of thumb about data storage in general: The data should be stored only to the extent that someone feels it matters enough to describe what it actually is. If no one can be bothered to pin down a description of this variable and no one can be bothered to store that description anywhere it is because it doesn’t matter.

 

 

 

 

 

https://en.wikipedia.org/wiki/The_Birds_(play)

 

Big Data From a Product Perspective – Different Views

The hype surrounding big data at the moment is reaching a climax. While it is evident that we have more and more data and that there are valuable insights hidden, the situation is different if we look at big data as the products that are actually offered.

If we look at big data from a product perspective I think the situation is a bit more mixed. As a product category big data is not yet mature enough to warrant these huge valuations, but that could happen if a couple of things happen. But first let us look at how big data is viewed by different groups

Different views on Big data
From inside the big data community the focus is on technologies like Hadoop, Hive, nosql databases and the companies supporting these technologies and a plethora of other more or less obscure (to the uninitiated of course) products that are part of the big data ecosystem. It is not closely related to business intelligence although it is vey much the same problem big data is solving.

If we look at how the media sees it we are looking at something similar to the invention of the wheel. Something that will have profound effect on human civilisation and the way we live for millennia to come.

Investors see big data investments like the quest for the holy grail (which might explain some of the silliness): Hortonworks has raised $248 million, Coudera $1,2 billion, Datastax $189 million, Elastic search $104 million, Couchbase $106 million etc. All of these companies don’t have a proprietary product, but support open source products. The business model is one of building closed source tools that let customers run the open source better.

The CEOs who invest in big data really just want a big pile of money. They are not interested in the curios patterns you can find like the correlation between search terms containing the word coconut and the migratory patterns of the African swallow. They see in big data a new way to make more money and just want to get to that immediately.

The CIO is usually completely sidelined in decisions involving big data. Maybe because he is increasingly becoming the custodian of legacy technologies, but the need for big data often come from isolated infrastructure projects or from business development.

Developers view big data as models of the real world with intricate detail like the matrix. Soon we will be able to model the entire universe and predict what will happen with big data technology.

What end users see is of an alarming complexity. You need to have semi programming skills in order to extract even simple queries. You also need to be adept at manoeuvring applications with hundreds of functions similar to the sys admin. This is often the case with open source development that usability suffers, because the community wants to take the product in different directions. Furthermore developers are users and they already know the product so there is no real pressure to make the product easy to use for the uninitiated.

What it really is? In the end big data may very well turn out to be just like the segway. I am not saying that it will only be used by mall cops and tourists, but rather that it might end up servicing very limited segments and industries with very specific needs.

Enter the genius – the five specialisations of the big data employee?
The problem today is that in order to get any value out of big data you need to be a virtual genius. you need to master at least four areas that are usually specialisations

  1. First of all you need to be a developer. You might not need to code an actual application if you are just using it for analytical purposes, but you need to be able to write code to extract the information you need one way or another.
  2. Second, you need to be an infrastructure architect and sysadmin because you need to set up a great number of servers and networks. You need to know about the multitude of different infrastructure elements.
  3. Third, you need to be a database administrator. You need to set up databases and maintain them. You need to set up ETL processes, sharding and the like (you do not have to worry about database schemas though).
  4. Fourth, you need to be a data scientist since you need to know a fair amount about machine learning algorithms in order to extract patterns from the data.
  5. Fifth, you need to be a business analyst. If big data is to make sense from a business perspective it is necessary to understand the business model, the revenue streams, the cost structure etc. You also need to know a fair amount about the customers like what parameters to segment them by and what their pains are.

Naturally you don’t have to have all that in one person. In principle it can be spread across several employees, but quickly you will have to hire a complete team in order to just get started, although it is still difficult to find specialists that know just one or two of these things. On top of this you need very tight integration, because big data is more integrated than other technologies.

If you succeed with this the problems are not over unfortunately. Most organisations already have established procedures where work is split up along the lines mentioned above. You have application developers, operations, DBAs, analysts and business developers. Each department has it’s own governance and procedures describing hand offs to other departments. Now you are asking the organisation to circumvent all of these established procedures.

So big data products still have a long way to go before they are ready for the mass market and the really big bucks.