top of page
Search

Data in the cloud: Data Warehouse vs Data Lake vs Data Lakehouse



Now in 2022 there seems to be as many different data storage solutions to choose from as there are data points to be stored. How does one choose which one to go for? Is there a one fit all solution? Somehow, I doubt there is a silver bullet but maybe if we break down the different philosophies of different solutions we might be able to make a little more sense of it all.


The Data Warehouse architecture has been around since the 90s. It is a system that stores highly structured data from different sources into a single repository. This allows for business intelligence solutions, such as Tableau or QlickView, to easily analyse the data and create reports.


Using a data warehouse can have some great benefits such as improving data quality and consistency within an organisation leading to a higher quality of data analytics which in turn leads to an improvement to decision making within a company. However, data warehouses are not without their drawbacks, a lack of support for unstructured data along with a high implementation and maintenance cost can be major limiters for some organisations.



Next we’ll look at the Data Lake, the term is believed to have been coined by James Dixon, CTO of Pentaho in 2010, when he described the Data lake on his blog.


“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

In essence a data lake stores data in its raw form, unprocessed, unfiltered and unanalysed. This data could be relational, such as in the form of database tables, semi-structured like log files or unstructured such as IoT data or social media data, or it could well be a mix. Once the data is in the data lake then it allows Data Scientists to crawl, catalogue, index and analyse it all from within a centralised repository.


Data lakes allow for great data flexibility, all data whether structured or unstructured can be stored in the data lake. They are designed to utilise object storage which allows for major cost savings vs a data warehouse. Having a large amount of data makes it much easier for applying machine learning algorithms to process the data. However, there are downsides. Data lakes can easily become disorganised which can lead to performance degradation as the data lake grows in size.


The latest evolution for big data storage is the data lakehouse. As the name suggests it attempts to be a combination of the data structuring and data management of a data warehouse with the low cost storage of a data lake. In 2019 AWS used the term ‘lake house’ in relation to Amazon Redshift Spectrum, which is a service that allows users to run SQL queries against their data stored in AWS S3 object storage.


Data Lakehouses aim to give us the best of both worlds, we can apply both business analytics workloads as well as machine learning workloads from the same data store. This removes the need for a separate data warehouse and data lake. We don’t have the infrastructure cost involved with a data warehouse as we can utilise cost effective object storage just like with the data lake. As with any hybridised approach there is an added level of complexity to using a lakehouse, coupled with the fact of how new the approach is the solutions aren’t as mature.


So which is the right approach to use? Well that depends. The low cost of data lakes is appealing when data volumes get into the petabyte range, but the performance and consistency of a data warehouse is exactly what data intelligence professionals require, which leaves the lakehouse looking ever more tempting now.



 
 
 

Comments


iot-worx_logo-transp-use-on-black.png

Unit 2,

The Ennistymon HUB

Ennistymon,

Co. Clare,

Ireland

V95 NX86

Tel: +353 (0)65 7051520

  • Twitter
  • LinkedIn
badge.png

© 2022 IOT-WORX 

IOT-WORX is a tradename of Think Robotics Ltd.

bottom of page