Data Is the New Oil – Building the Data Refinery

“Data Is the New Oil!”

Mathematician and IT Architect Clive Humby seems to have been the first to coin the phrase in 2006 where he helped Tesco develop from a fledgling UK retail chain to an inter continental industry titan only rivaled be the likes of Walmart and Carrefour through the use of data through the Tesco reward program. Several people have reiterated the concept subsequently. But the realization did not really hit primetime until the economist in May 2017 claimed that data had surpassed oil as the most valuable resource

Data, however, is not just out there and up for grabs. Just like you have to get oil out of the ground first, data poses similar challenges: you need to get it out of computer systems or devices first. When you do get the oil out of the ground it is still virtually useless. Crude oil is just a nondescript blob of black goo. Getting the oil is just a third of the job. This is why we have oil refineries. Oil refineries turn crude oil into valuable and consumable resources like gas or diesel or propane. It splits the raw oil into different substances that can be used for multiple different products like paint, asphalt, nail polish, basketballs, fishing boots, guitar strings and aspirin. This is awesome; can you imagine a world without guitar strings, fishing boots or Aspirin? That would be like Harry Potter just without the magic…

Similarly even if we can get our hands on it, raw data is completely useless. If you have ever glanced at a webserver log, a binary data stream or other machine generated code you can relate to the analogy of crude oil as a big useless blob of black goo. All this data does not mean anything in itself. Getting the raw data is of course a challenge in some cases, but making it useful is a completely different story. That is why we need to build data refineries. Systems that turn the useless raw data into components that we can build useful data products from.

Building the data refinery

For the past year or so, we have worked to design and architect such a data refinery at New York City. The “Data as a Service” program is the effort to build this refinery for turning raw data from the City of New York into valuable and consumable services to be used by City agencies, residents and the rest of the world. We have multiple data sources in systems of record, registers, logs, official filings and applications, inspections and hundreds of thousands of devices. Only a fraction of this data is even available today. When it is available it is hard to discover and use. The purpose of Data as a Service is to make all the hidden data available and useful. We are turning all this raw data into valuable and consumable data services.

A typical refinery processes crude oil. This is done through a series of distinct processes and results in distinct products that can be used for different purposes. The purpose of the refinery is to break down the crude oil to distinct useful by-products. The Data as a Service refinery has five capability domains we want to manage in order to break the raw data down into useful data assets:

  • Quality is about the character and validity of the data assets
  • Movement is how we transfer and transform data assets from one place to another
  • Storage deals with how we retain data assets for later use
  • Discovery has to with how we locate the data assets we need
  • Access deals with how we allow users and other solutions to interact with data assets

Let us look at each of these in a bit more detail.


The first capability domain addresses the quality of the data. The raw data is initially of low quality like the crude oil. It may be a stream of bits or characters, telemetry data, logs or CSV files.

The first thing to think about in any data refinery is how to assess and manage the quality of the data. We want to understand and control the quality of data.  We want to know how many data objects there are if they are of the right format or if they are corrupted. Simple descriptive reports like the number of distinct values, type mismatch, number of nulls etc. can be very revealing and important when considering how it can be used by other systems and processes.

Once we know the quality of the data we may want to intervene and do something about it. Data preparation formats the data from its initial raw form. It may also validate that the data is not corrupted and can delete, insert and transform values according to preconfigured rules. This is the first diagnostic and cleansing of the data in the DaaS refinery.

Once we have the initial data objects lined up in an appropriate format Master Data Management is what allows us to work proactively and reactively with improving the data. With MDM we will be able to uniquely identify data objects across multiple different solutions and format them into a common semantic model. MDM enables an organization to manage data assets and produce golden records, identify and eliminate duplicates and control what data entities are valid and invalid.

Data movement

Once we have made sure that we can manage the quality of the data we can proceed to the next phase. Here we will move and transform the data into more useful formats. We may, however, need to move data differently. Sometimes it is all well to move it once a day, week or even month, but more often we want the data immediately.

Batch is movement and transformation of large quantities of data from one form and place to another. A typical batch program is executed on a schedule and goes through a sequence of processing steps that transforms the data from one form into another. It can range from simple formatting changes and aggregations to complex machine learning models. I should add that what is sometimes called Managed File Transfer, where a file is simply moved, that is, not transformed can be seen as a primitive form of batch processing, but in this context it is considered a way of accessing data and described below.

The Enterprise Service Bus is a processing paradigm that lets different programmatic solutions interact with each other through messaging. A message is a small discrete unit of data that can be routed, transformed distributed and otherwise processed as part of the information flow in the Service Bus. This is what we use when systems need to communicate across city agencies. It is a centralized orchestration.

But some data is not as nicely and easily managed. Some times we see use cases where the processing can’t wait for batch processing and the ESB paradigm does not scale with the quantities. Real time processing works on data that arrives in continuous streams. It has limited routing and transformation capabilities, but is especially geared towards handling large amounts of data that comes in continuously either to store, process or forward it.


Moving the data naturally requires places to move it to. Different ways of storing data have different properties and we want to optimize the utility by choosing the right way to store the data.

One of the most important and widespread ways to store data is the Data Warehouse. This is a structured store that contains data prepared for frequent ad hoc exploration by the business. It can contain pre-aggregated data and calculations that are often needed. Schemas are built in advance to address reporting needs. The Data Warehouse focuses on centralized storage and consequently data, which has a utility across different city agencies.

Whereas Data Warehouses are central stores of high quality validated data, Data Marts are similar local data stores. They are similar to Data Warehouses in that the data is prepared to some degree, but the scope is more local for an agency to do analytics internally. Frequently the data schema found are also more of an ad hoc character that may not be designed for wide spread consumption. It also serves as a user driven test bed for experiments. If an agency wants to create a data source and figure out if it has any utility, the data mart is a great way to quickly and in a decentralized manner create value in an agile manner.

Where Data Warehouses and Data Marts store structured data, a data lake is primarily a store for unstructured data, like csv, XML, log files as well as binary formats like video and audio. The data lake is a place to throw data first and then think about how to use it later. There are several zones within the data lake with varying degrees of structure: like the raw, analytical, discovery, operational and archive zones. Some parts like the analytical zone can be as structured as Data Marts and be queried with SQL or similar syntax (HiveQL), where others like the raw zone requires more programming to extract meaning. The data lake is a key component in bringing in more data and transforming it to something useful and valuable.

The Operational Data Store is in essence a read replica of an operational database. It is used in order not to unnecessarily tax an operational, transactional database with queries.

The City used to have real warehouses filled with paper archives that burned down every now and then. The reason for this is that all data has a retention policy that specifies how long is should be stored. This need is still there when we digitize data. Consequently we need to be in complete control of all data assets’ lifecycle. The archive is where data will be moved when there is no more need to access the data frequently. Consequently data access can have a long latency period. Archives are typically used in cases where regulatory requirements warrant data to be kept for a specific period of time.


Now that we have ways to control the quality, move the data and store it we also need to be able to discover it. Data that cannot be found are useless. Therefore we need to supply a number of capabilities for finding the data we need.

If the user is in need of a particular data asset, search is the way to locate it. Based on familiar query functions the user can use single words or strings. We all know this from on line search engines. The need is the same here: to be able to intelligently locate the right data asset based on an input string.

When the user does not know exactly what data assets he or she is looking for we want to be able to supply other ways of discovering data. In a data catalog the user can browse existing data sources and locate the needed data based on tags or groups. The catalog also allows previews as well as additional meta-data about the data source, such as descriptions, data dictionaries and experts to contact. This is good if the user does not know exactly what data asset is needed.

In some cases a user group knows exactly what subset of data is needed. The data may not all reside in the same place or format. By introducing a virtual layer between the user and the data sources it is possible to create durable semantic layers that remain even when data sources are switched. It is also possible to tailor specific views of the same data source tailored to a particular audience. This way the view of the data will cater to the needs of individual user groups rather than a catch all lowest common denominator version, which is particularly convenient since access to sensitive data is granted on a per case basis. The data virtualization will make it possible for users to discover only the data they are legally mandated to view.


Now that we are in control of the quality of data and who can use it, we also need to think about how we can let users consume the data. Across the city there are very different needs for consuming data.

Access by applications is granted through an API and supplies a standardized way for programmatic access by external and internal IT solutions. The API controls ad hoc data access and also supplies documentation that allows developers to interact with the data through a developer portal. Typically the data elements are smaller and involve a dialogue between the solution and the API.

When files need to be moved securely between different points without any transformation a managed file transfer solutions is used. This is also typically accessed by applications, but a portal also allows humans to upload or download the file. This is to be distinguished from document sharing sites like sharepoint, work docs, box and google docs where the purpose is for human end users to share files with other humans and typically cooperate on authoring them.

An end user will sometimes need to query a data source in order to extract a subset of the data. Query allows this form of ad hoc access to underlying structured or semi structured data sources. This is typically done through SQL. An extension of this is natural language queries thorough which the user can interrogate a data source through questions and answers. With the advent of colloquial interfaces like Alexa, Siri and Cortana this is something we expect to develop further.

A stream is a continuous sequence of data that applications can use. The data in a stream is supplied as a subscription to streams in a real time fashion. This is used when time and latency is of the essence. The receiving system will need to parse and process the stream by itself.

Contrary to this, events are already processed and are essentially messages that function as triggers from systems that indicate that something has happened or should happen. Other systems can subscribe to events and implement adequate responses to them. Similar to streams they are real time, but contrary to streams they are not continuous. They also resemble APIs in that it is usually smaller messages, but differs in that they implement a push pattern.

Implementing the refinery

Naturally some of this has already been built, since processing data is not something new. What we try to do with the Data as a Service program is to modernize existing implementations of the above-mentioned capabilities and plan for how to implement the missing ones. This involves a jigsaw puzzle of projects, stakeholders and possibilities. Like most other places we are not working from a green field and there is no multi million-dollar budget for creating all these interesting new solutions. Rather we have to continuously come up with ways to reach the target incrementally. This is what I have previously described as pragmatic idealism . What is important for us, as I suspect it will be for others, is to have a bold and comprehensive vision for where we want to go. That way we can hold up every project or idea against this target and evaluate how we can continuously progress closer to our goal. As our team’s motto goes “Enterprise Architecture – One solution at the time”

The Data Deluge, Birds and the Beginning of Memory

One of my heroes is the avant garde artist Laurie Anderson. She is probably best known for the unlikely hit “Oh Superman”  in the eighties and being married to Lou Reed, but I think she is an artist of comparable or even greater magnitude. On one of her later albums is a typical Laurie Anderson song called: “The Beginning of Memory”. Being a data guy this naturally piqued my interest. It was sort of a win-win scenario. The song is an account of a myth from an Ancient Greek play by Aristophanes: “The Birds”. Here are the lyrics to the song :

There’s a story in an ancient play about birds called The Birds
And it’s a short story from before the world began
From a time when there was no earth, no land
Only air and birds everywhere

But the thing was there was no place to land
Because there was no land
So they just circled around and around
Because this was before the world began

And the sound was deafening. Songbirds were everywhere
Billions and billions and billions of birds

And one of these birds was a lark and one day her father died
And this was a really big problem because what should they do with the body?There was no place to put the body because there was no earth

And finally the lark had a solution
She decided to bury her father in the back of her own head
And this was the beginning of memory
Because before this no one could remember a thing
They were just constantly flying in circles
Constantly flying in huge circles

While myths are believed to be literal truth by very few people they usually point to some more abstract and deeper truth. It is rarely clear exactly how and what it means. But I think I see the deeper point here that may actually teach us something valuable. Bear with me for a second.

The Data Deluge and The Beginning of Memory

The feeling I got from the song was eerily familiar with the feeling I get from working with Internet of Things. Our phones constantly track our movements; our cars record data on the engine and performance. Sensors that monitor us every minute of our lives are silently invading our world. When we go through the streets of Manhattan we are monitored by the NYPDs system of surveillance cameras, Alexa is listening in on our conversations and Nest thermostats sense when we are home.

This is what is frequently referred to as the Internet of things. The analogy to the story about the birds is that until now we have just been flying about in circles with no real sense of direction or persistence to our movement. What is often overlooked is that the fact that we can now measure the movement and status of things only amplifies the cacophony of the deafening sound of billions of billions of birds, sorry, devices.

This is where the birth of memory comes in. Because not until the beginning of memory do we gain firm ground under our feet. It is only with memory that we provide some persistence to our throngs of devices and their song. We capture signals and persist them in one form of memory or another.

The majority of interest in IoT is currently dedicated to exactly this process, how do we capture the data? What protocols do we use? Is MQTT better or does AMQP provide a better mechanism? What is the velocity and volume of the data? Do we capture it as a stream or as micro batches?

We also spend a great deal of time figuring out whether it is better to store in HDFS, Mongo DB, or Hbase, should we use Azure SQL Data Warehouse or Redshift or something else? We read studies about performance benchmarks and guidelines to making these choices (I do at least).

These are all worthwhile and interesting problems that also capture a large part of my time, but it also completely misses the point! If we refer back to the ancient myth, the Lark did not want to remember and persist everything, it merely wanted to persist the death of its father, it only wanted to persist something because it was something that mattered!

What Actually Matters?

And this is where we go wrong. We are just persisting the same incessant bird song frequently without pausing to think about what actually matters. We should heed the advice of the ancient myth and reflect on what is important to persist. I know this is against most received wisdom in BI and Big Data, where the mantra has been “persist as much as possible, you never know when you are going to need it”

But actually the tides are turning on that view due to a number of new limiting factors such as storage, processing and connectivity. Granted, storage is still getting cheaper and cheaper and network bandwidth more and more ample. Even processing is getting cheaper. However, if you look closely at the fine print of the cloud vendors, services that process data and move data are not all that cheap. And you do need to move the data and process it in order to do anything with it. Amazon will allow you to store anything at next to no cost in S3, but if you want to process it with Glue or query with Athena it is not so cheap.

Another emerging constraining factor is connectivity. Many devices today still connect to the Internet through the cellular network. Now, cellular networks are operated by carriers that pay good money for the frequencies used. This money is passed on to the users. On average a device is not different from a cell phone, so naturally you have to pay something close to the price of a cell phone connection, around $30 to $40. I do get the enthusiasm around billions of devices, but if the majority of these are connecting to the internet through the cellular radio spectrum, then the price is also billions of dollars.

Suddenly, the bird song is not so pleasant to most ears and our ornithological enthusiasm is significantly curbed. These trends are sufficient to warrant us starting to think about persisting only what actually matters. That can be a lot, if you really have a feasible use case for storing for example for storing all your engine data (which you might), it could also be that the 120 data points per second from your connected tooth brush may turn out to probably not matter that much.

And I haven’t even started to touch on how you would ever find sense in all the data that you persisted to memory. Most solutions do not employ adequate metadata management or data catalogs or other solutions that would tell anyone what a piece of data actually “means”. If we don’t know or have any way of knowing what a piece of data means there is absolutely no reason to store it. If you have a data feed with 20 variables but you don’t know what they are, how is it ever going to help you?

Store what matters

This can actually be turned into a rule of thumb about data storage in general: The data should be stored only to the extent that someone feels it matters enough to describe what it actually is. If no one can be bothered to pin down a description of this variable and no one can be bothered to store that description anywhere it is because it doesn’t matter.




A/B testing for product managers

Neil McCarthy is Director of Product Management at Yammer where he has worked for the past three and a half years. Coming from an education in electrical engineering he has worked for the past 10 years in enterprise software in roles bordering between the business and the technical side.

At Yammer they decided early on to become a data informed company and invested heavily in an infrastructure to support this along with a team of data scientists. Today, no new feature is released without an A/B test.

Why A/B test your product?
I asked Neil what A/B testing can do that other methods for getting customer feedback, such as focus groups and surveys, can’t do.

“A/B testing helps product teams move faster by helping them build the right things and validate their assumptions along the way. A/B testing is a great way to test an idea you already have, but it’s not a great a way to come up with new ideas. Gathering user feedback and thinking strategically about the future of the product and industry is a better way to come up with good ideas.”

At Yammer they also do qualitative and quantitative research post project to figure out what people are actually doing. This plays a big part in figuring out what happened when a test fails.

One example of such a test that turned out to be worse than baseline was when they decided to try to alter the sign up flow. Conventional wisdom has it that the more friction you take out of the sign up flow the better the retention of the customer. So, Yammer hypothesized that by taking out a few steps of the sign up flow and putting them into the product, they could increase long term retention. But to their surprise it turned out that taking out these steps had the opposite effect. The sign up flow was helping users understand what Yammer is. Therefore they did not keep the change and instead left the sign up flow as is. Another example of something that was a success was when they tested whether including a module in the feed that suggested the user to follow other users that their friends followed. It turned out that a lot of users started to follow others and this resulted in a lift in the core metric of days engaged.

How to test
Yammer is not Twitter or Facebook who can do significant tests with only 1% of their users. Instead, Yammer usually tests on 50% of their users. Still it take minimum 2 weeks to do a test. The problem is that since you are testing hypotheses, some of which are proven incorrect, it feels like the advancement of the product is slower. In actuality, you’re moving faster because you eliminate a lot of waste and complexity by not implementing features that are unsuccessful.

“The core of A/B testing is to have a hypothesis. At Yammer hypotheses are rigorously formulated into if/then statements. For example “if we increase the priority of groups, then more users will get work done in Yammer”. This will be broken down into smaller hypothesis that can more easily tested, like: “If we increase the prominence of the group join button then more users will join groups and engage more frequently with Yammer”.

How to avoid local maximum
A well known problem with A/B testing and any other incremental test method is the problem of the local maximum. This happens when a product reaches the point where small changes no longer significantly improve it. At Yammer they have avoided local maximum problems by periodically taking big bets, where they work on really big features. Even for bigger features, they’ll break down the project into small pieces so they can execute incrementally.

Getting started with A/B tests
I also asked Neil what he thought the current best practice for A/B testing was. Here is a list of four key ingredients in successful A/B testing for product managers.
1) Having the right hypotheses is necessary. If you don’t have well informed hypotheses, A/B testing will not help you no matter what degree of technical perfection you have.
2) Log everything users do. This is not to help the A/B test in itself, but in order to understand post hoc, what happened. Why did the test go wrong? Why did the users not react as expected?
3) Have a solid A/B testing framework in place. Without the technical framework to do it you won’t succeed.
4) Put statistical rigor into guidelines for conducting the A/B tests. You need to make sure you are considering statistical significance when looking at the results so you only conclude on true positives.