The client, a leading company in the marketing and digital advertising sector, has started an IT transformation program aimed at creating a new data platform with Google Cloud Platform to which to migrate the existing data warehouse and reporting applications, to achieve the following objectives:
- Adopting a new flexible, modular and scalable data platform, simplifying the current architecture by introducing a cloud-based technology stack
- Improving data quality
- Creating a unified data view that includes all the entities with which it is possible to describe the company business and monitor its performance
- Anonymizing sensitive data according to the GDPR regulations
For the implementation of the data platform, Google Cloud Storage was used as a data lake and Google BigQuery as a data warehouse. The data was organized into zones:
- In the Raw Zone there are the raw data divided, based on their use, in landing (data not yet processed), archived (data already successfully processed and archived), and invalid (data processed without success), in addition to data relating to log in the Logging zone.
- In the Processed Zone on BigQuery you will find structured data that has successfully passed a first level of cleaning and normalization.
- The Refined Zone contains the transformed data in the unified view, available for analysis and reporting needs
The data loading system has been implemented in a flexible, modular and scalable way, allowing the addition of new flows and/or source systems with minimal impact.
Flexibility also characterizes the construction phase of the views for reporting, which provides for the possibility of adding new KPIs in a parametric way: in this way the structures are automatically updated during the release phase, without the need for manual changes to the code.
In the Refined Zone there are also predictive models, the results of which are in turn available for further analysis (e.g. customer churn prediction).
The scheduling of processes is managed with Cloud Composer, while for their monitoring, including the verification of data quality, alerting thresholds have been defined with automatic sending of emails via Cloud Logging; for more in-depth analysis, a dashboard created with Google Data Studio is available.
To support Governance, to ensure the traceability of information from the source systems to the views for reporting, the data lineage function has been implemented, accessible from the Google Data Catalog service, from which it is also possible to access the description and other information of all the entities (tables, fields, files…) present in the data platform.
During implementation, the utmost attention was always paid to cost control, through continuous monitoring of the impact of the released code and verification of the application of best practices.