Lately, there has been a lot of fuss about data lakes and data warehouses with my employer. What I noticed, is that everyone defined a data lake in a way that suited their particular problem. Either with data collection, transformation, scale or analytic capabilities. It has become a silver bullet, a magical thing that will solve all our problems with data collection and analysis. But, as soon as two people from different departments start talking to each other about a data lake it becomes obvious that we all miss a common understanding of what it actually is.
About a year before all this fuss started, my team was responsible for maintaining one of the bigger databases in the company. Although most of our stakeholders called it a data warehouse, this was not a traditional data warehouse. It did not have a star schema and it was not used to deliver KPIs and financial business reports. It was a place where we combined data from many data sources in order to analyze it as a whole and provide insight into customer behavior.
From technology point of view, we started by using what was already available and familiar to the company, a SQL Server instance.
As data literacy in the company matured and people started to realize that the only way to accomplish their goals was by combining data from multiple systems, our scope expanded and our database and user base grew. It became obvious that we are not going to be able to sustain this with our current technology stack, so we started exploring other options. We experimented with several technologies, but in the end decided to build what is now popularly called a data lake.
But, first things first. We need to get on the same page with other departments on what this data lake thing is and how it will change their day to day work. How do we show them that this magical thing is not that magical at all? Given that most of our users are familiar with the concept of a data warehouse and with the SQL language, can we relate this concepts of the data warehouse and the data lake and help them understand it?
Obviously, problems I'm try to describe here are not new or unique, they surface in every company sooner or later. As companies grow, number of data sources grows with them. At some point the need arises to combine those sources so that we can get a complete picture of what is going on.
I would not call myself an expert when it comes to data warehouses, so you should obviously take what I'm saying here with a grain of salt and do further investigation. Also, I will oversimplify some aspects of the problem, because my goal is not to explain technical detail, but rather to relate technologies used to build a data lake to less technical audience that might be familiar with the concepts of data warehousing. With that said, let's dig in.
Data Warehouse and the Data Lake
Traditionally relational databases like SQL Server, Oracle, PostreSQL, MySQL, or some other were used for building data warehouses. When that was not enough, people uses specialized, Massively Parallel Processing (MPP) database solutions, but those came with a cost.
At the same time, companies also started collecting unstructured data and using it for analytical purposes. This is not something relational databases are well equipped to deal with. Because of the enormous growth of data collection within companies of every size, we needed a more cost effective solution for building distributed data warehouses, one that will also allow us to handle unstructured data.
What did not grow at the same rate as data growth is peoples ability to adapt to change. This means that if we want our new systems to be be usable and if we want our users to be productive with them we need to maintain a familiar interface. Simply put, we need to allow our users to write SQL.
If were were to decompose relational database products we would find that they all have similar components tied together to provide a service for storing and retrieving data. Figure below shows a simplified diagram of relational database product components. We can see four main components; storage engine, schema, query engine and programability. In a real product, this picture is much more complicated, but what remains is that all these components run in the same process.
How does this relate to a data lake?
This figure shows a common (and basic) implementation of a data lake. You can see that the colors match the ones in the previous figure. In a data lake you can find the same components that exist in a data warehouse. They both have a storage engine, a schema definition, some type of query engine and some way to write more complex programing logic.
Let's compare these components.
In a traditional single machine data warehouse, we will have a storage engine that stores data to some local volume. That volume can truly be local to the machine or can be part of a SAN. In any case, we store some data to some file system. Data lake has the same concept. It has a storage layer where we store our data files. Difference is that this is usually a distributed file system like HDFS or in the AWS cloud, S3.
Going one level up in the data warehouse, we start dealing with the schema. Schema maps our logical data model to the physical storage. This sounds like a useful think to have if we want to be able to quickly explore and analyze data. And, of course, we have it in the data lake to. Usually, in the data lake world, this is called a meta store or a data catalog. Most widely used implementation is that of Hive meta store. In the AWS cloud we can use AWS Glue Data Catalog which is AWS implementation of the Hive meta store. Meta store stores information about our databases and tables. It maps them to the files stored in the data layer.
Now that we have some schema, we need to be able to query it. In the data lake world there are many query engines to chose from, but in the end they all server the same purpose, allow end users to write SQL to query the data, same as traditional data warehouse does.
What about stored procedures and programmability? Querying data is often not enough, we need to be able to transform the data and store it in a new table. In a data warehouse we might write a SQL procedure to do that, but in the data lake, we would do it a bit differently. In any case, there is a replacement. Since this is not something end users are usually doing, it should not impact our productivity.
As you can see, there are a lot of similar concepts between a data warehouse and a data lake. These concepts are useful for demystifying the data lake, but they are not what makes the data lake compelling. They are what make it familiar. What makes it compelling is that these components are no longer coupled to a single machine, a single process. They are decoupled and can be evolved and scaled separately. They can also be used separately. Data scientists can connect directly to the data store with specialized tools and produce new data, analyst can connect to the query engine to do some analysis on well defined schema. We don't need to constantly export and import data from one store to another. We can focus on using the data to provide insights.
One thing to keep in mind is that even though there are many similarities between traditional data warehouses and data lakes and that they are adopting each others features, they are still not the same. There are scenarios in which one is better that the other. Because of that, it is often the case that people use both in their solutions.
I don't know what makes sense for your use case, but I do wish that this post can help you explain and demystify a data lake either to yourself or to your colleagues.