Data Lake Architecture: How to Prepare a Business for the Challenges of the Data-Driven World

In today’s world, every modern business operates in the digital space. Every organization has to go through various stages of digitization – from basic bookkeeping of financial transactions to advanced analytics, predictive modeling, and the development of data-driven digital services.

For most businesses, it all starts with Excel spreadsheets. However, as the number of data sources within the organization grows, so does the need to use various business applications, such as CRMs, ERPs, or WMSs. Yet, in most cases, reports and data analysis are still done in Excel.

Eventually, this leads to too many reports being generated from exported data from various systems. Every day, it takes time to manually prepare the data for reporting, and merging data from different systems can be a challenge. And when it comes to updating data from past periods, it can turn into a nightmare. Finding errors is equally complicated, as the data needed for reports has to be processed manually. In many cases, it would actually be easier to start the whole report from scratch.

“Even Excel is better than nothing”

As a first step, it is always necessary to interview the areas within the company where data is generated, reports are created, or reports are used. In the case of the article cited, this is exactly what they did, and it turned out that employees used Excel workbooks as a source, which they loaded into Power BI to create reports and visualizations.

Of course, this is not the most efficient method, as in most cases, it is possible to connect directly to the data from the ERP system, and calculations can also be done much more easily in Power BI.

They then reviewed the existing reports and compared them with the data sources. It turned out that they not only imported data from Excel workbooks, but also used other sources – files that contained design data and reports from external cloud-based SaaSs. They even used exported PDF files from their own vehicle fleet tracking provider, containing reports on the movement of their own vehicles.

It was for this reason that they created an Excel file that contained all sources and their properties (access method, granularity, update speed, change history, etc.). These were essentially metadata, but for this purpose, there is a custom-developed software, the data catalog. However, in order to establish a data culture within a company, they had to start small. The Excel file was a perfect first step towards the future catalog.

Architecture

The data used had to be regularly extracted from sources and manually updated to ensure that analysts had access to the most detailed level of data possible. Reports had to be generated on an ad hoc basis, as it was impossible to predict when specific reports would be needed.

Based on this information, the classic approach to developing a corporate data warehouse was discarded during the survey. The decision was made not to create a unified data model with facts, measurements, and predetermined formulas for calculating reports. It was concluded that the most appropriate architecture type in this case was the “Data Lake”. Of course, the Data Lake architecture has its advantages and disadvantages, and it is always necessary to examine whether it is the best solution in our case.

The Data Lake architecture can be implemented using various software tools, such as an open-source stack that uses Apache components. In the example given in the article, the client already had a subscription to Microsoft Azure, and some parts of their business applications were using cloud services such as MS Dynamics 365. It should be noted that MS Azure has undergone significant development in recent years and now includes all the necessary tools for developing enterprise analytics and Big Data systems. Therefore, it was decided to set up the system using MS Azure.

Serverless SQL Pool was used to store detailed data, Dedicated SQL Pool was used for data tags, and Synapse Pipelines were used for data extraction, transformation, and loading (ETL). Power BI can connect to both data storage layers in import mode and live query mode, so users are not limited in further data processing, and it is possible to write DAX measures.

Let’s eat the cake slowly

There are many ways to implement a project, and different methodologies can be applied during the implementation. However, nobody likes to compile lengthy technical requirements, and everyone wants quick results. Therefore, for the project discussed in this article, they decided to implement it using short iterations and an agile development style.

A project aimed at building a BI system can be implemented in short sprints. For example, one can start with processing all data sources and add a new processing stage in each iteration. A sprint can be focused on the initial load into the operational data store, another on loading into the data warehouse, and further sprints can be designated for creating reports in the Data Mart.

In this case, a Data Lake architecture was used, with only two layers but several dozen data sources. This meant that it was logical to divide the work into sprints, with multiple sources brought in each sprint, and the entire development process completed from assembly to the finished Data Mart.

Additionally, this is very convenient for users, as it often happens that data for a report comes from 2-3 sources. Therefore, the results of each iteration provide analysts with data that they can use to transfer one or more existing reports to the Data Lake. This allows for a gradual transition to using the data warehouse and avoids the distraction of analysts from their work by having to verify a large amount of data in a short time.

Finally, each iteration consisted of the following:

• Developing an adapter to extract data

• Initial loading

• Configuring the data update process

• Calculating the Data Mart

• Checking the Data Mart data by users (analysts)

• Transitioning reports from Excel to Azure Synapse

In this project, a total of 37 data source systems were connected. These included Excel spreadsheets, relational databases, external services with APIs, and other sources.

Learning from mistakes

Of course, no agile development process can proceed without errors and stabilization. Therefore, instead of immediate results, they opted for in-depth analysis. It became clear that it would be costly to place all data from the Data Lake into a dedicated SQL Pool.

They suggested loading Power BI reports with data import instead of Direct Query to save money for the client. Pre-made data would then be stored in the Power BI data model, so they would not consume storage and would be available regardless of Pool availability.

Furthermore, they configured the settings for updating datasets for all reports, including data refresh (using Synapse Pipelines). Finally, they had to solve the scheduled activation and deactivation of the Dedicated SQL service. As a result, service costs decreased nearly four times.

The second mistake was collaboration with an external SaaS for extracting data related to the company’s own vehicle fleet movements. As often happened, the service provided data through a JSON-formatted API. However, the format of this JSON was not entirely correct. With smaller data sizes, this did not pose a problem, but as soon as the client requested the detailed GPS coordinates for each trip, it became clear that the Azure Pipelines standard tools could not be used to work with improperly formatted JSON. The solution was bundled loading of SQL and JSON texts.

They also encountered another problem, which was also an external service. For some internal reason, all permission keys were revoked, so ETL processes began to receive error messages (HTTP 403) instead of the usual data. This would not be such a big problem, but many APIs can be very unstable, so the configuration had to be changed. They managed to solve this problem by maximizing the queries per minute to 30.

Conclusion

The greatest advantage of this system is the savings in man-months or even years of work resulting from manual data processing. Of course, this can be quite expensive for a business, but the main advantage is something else. The biggest advantage is that the time between a request from top management and the completion of data analysis is significantly reduced. It is now possible to fulfill almost any analysis requirement within a working day. This time savings is very difficult to underestimate from the perspective of corporate decision-making.

Of course, there may be cases where there is not enough data in the data model to answer a management request. Fortunately, Power BI provides the ability to connect new sources to the data model with just a few clicks.

Overall, we can say that Data Lake analysis systems have become some kind of new step in digitization. From now on, analysts can perform their tasks by getting closer to the principle of self-service, and top management can always have access to up-to-date information for quick and accurate decision-making.