What is a Data Lake? A Super-Simple Explanation For Anyone
September 6, 2018 Bernard Marr
James Dixon, the CTO of Pentaho is credited with naming the concept of a data lake. He uses the following analogy:
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
A data lake holds data in an unstructured way and there is no hierarchy or organization among the individual pieces of data. It holds data in its rawest form—it’s not processed or analyzed. Additionally, a data lakes accepts and retains all data from all data sources, supports all data types and schemas (the way the data is stored in a database) are applied only when the data is ready to be used.
What is a data warehouse?
A data warehouse stores data in an organized manner with everything archived and ordered in a defined way. When a data warehouse is developed, a significant amount of effort occurs during the initial stages to analyze data sources and understand business processes.
Data lakes retain all data—structured, semi-structured and unstructured/raw data. It’s possible that some of the data in a data lake will never be used. Data lakes keep all data as well. A data warehouse only includes data that is processed (structured) and only the data that is necessary to use for reporting or to answer specific business questions.
Since a data lake lacks structure, it’s relatively easy to make changes to models and queries.
Data scientists are typically the ones who access the data in data lakes because they have the skill-set to do deep analysis.
Since data warehouses are more mature than data lakes, the security for data warehouses is also more mature.
more on big data in this IMS blog