The tooling and infrastructure that helps organisations extract value out of their data assets has changed dramatically in last decade. Among some of the key changes are
- Move to the cloud: Cloud presents an opportunity to use elastic computing resources and store data assets cheaply. This in turn drives powerful data warehousing technology like Big Query, Snowflake, Amazon Redshift, Firebolt etc. The pay per use model enables you to start small but quickly scale to massive amounts of data.
- ELT over ETL: The traditional (E)xtract (T)ransform and (L)oad approach for data pipeline took weeks if not months to create any tangible business value. As the warehouses became powerful to process data, ELT has emerged to be a more flexible (or agile) and cost effective way to create clean datasets which are ready to be analysed. This also enables outsourcing and standardizing process to get data to the data warehouse. Tools like Fivetran, Stitch, HevoData and Streamsets make it a breeze to aggregate data in one place for analysis at some later point in time.
- Deeper integration of data science and domain models: Modern data platform generates data once and allows it to be used in various business queries and machine learning models. There is seamless integration with BI tools and Data science tools so that consumers of data can access data right in the tools of their choice.
- Drive to self-service: Mature data platforms allow users to easliy discover and analyze data. Enterprises are building towards a future where non technical business users and marketers can answer their business questions and derive insights independently. This allows the company to move faster than ever before and not get bottlenecked by the IT team.
- Break data silos: In the new world where software is radically changing all departments of an enterprise, mature data orgs are breaking data silos by moving it across systems easily. Centrally managed gold standard datasets are the norms for a mature data practice.
This has led to the rise of a new era of infrastructure that supports a modern data driven enterprise. The components of such a modern data platform are:
- Data Integration: This is where you will start to build out the modern data platform. The goal of this component is to transport data from variety of source systems into your data storage layer.
- Transformation: Once the data is load, this component transforms the data to human usable metrics and features that represent business processes. There are two main approaches that have emerged in this space. SQL based approach for data warehouses or custom transformations in programming languages.
- Orchestration: As the system starts growing, orchestration system like Apache Airflow or Argo becomes imperative to manage complex workflows and dependencies across the data platform.
- Data Catalog and Governance: To enable the use of data across your organisation managing the metadata is important. There is a lot of innovation happening in the open source space for building a data catalog tool that will enable highly collaborative environment for your users of the data platform.
- Advanced Analytics: Presentation of business insights and operationalising predictions back to the consumer applications enable higher revenue and more efficient business operations of an enterprise.
Modern data platforms can’t be bought but they are built using plug and play blocks.