Navigating an Identity Crisis in the Data Industry

In the realm of data management, it appears that many data teams are currently grappling with a set of pressing challenges. For those well-versed in technology, this may seem like a daunting issue. The key to tackling this problem effectively is to prioritize the most valuable and high-quality data and focus on enhancing quality where it’s most critical.

Teams are using “data warehouses,” but not in the conventional way as an integration layer. Data engineers find themselves responsible for business logic but lack the time and space to fully comprehend it. Companies express a desire for AI, yet they are relying on data pipelines designed for analytics.

The primary culprit behind this situation is the pressure from businesses to decentralize data teams, while the tools and processes they employ remain centralized.

In the on-premises data world, cost constraints led data teams to act as gatekeepers, applying architectural principles, managing ETL processes, and data modeling early in the data pipeline.

However, the shift to the cloud, with its separation of storage and compute resources, coupled with the adoption of Agile software development and microservices, empowered engineering teams to independently push vast amounts of data into data lakes from various sources. The assumption was that the traditional data warehouse would still serve as the integrator of this data into a unified, consumable format for the business. The problem, however, lies in the human cost of maintaining cloud infrastructure, which is neither inexpensive nor straightforward.

Data infrastructure teams now face the challenge of managing platforms like Snowflake, Databricks, and Redshift, handling access control, implementing ELT systems, deploying streaming solutions like Kafka or Kinesis, processing data upon arrival, and managing modern data stack tools like dbt and Airflow, among other responsibilities.

All of this must be managed in the context of a rapidly expanding tech business with new APIs and databases emerging daily. Constant changes in schema and business logic mean that existing pipelines break, and data engineers end up serving as intermediaries to resolve these issues.

Once the core data products are set up, most data engineers lack the capacity to truly understand the business. Simultaneously, the surge in AI and machine learning has led to data scientists and researchers leveraging large volumes of raw and processed data for model training. These AI/ML teams operate as product teams, adhering to the Agile manifesto for quick iteration and short deployment cycles. Data scientists build features on top of early ad hoc analytics pipelines, release them into production, and eventually encounter significant data quality problems at scale.

The outcome? No one is content. Analytics pipelines go unmaintained, AI/ML pipelines break and cause disruptions, data engineers are overwhelmed, there’s a lack of a genuine data warehouse, and data quality issues persist, even as more data gets dumped into the data lake daily.

Addressing this issue requires a reset:

  1. Develop a clear data strategy that distinguishes the technical requirements for Business Intelligence (BI) and AI.
  2. Establish a practical ownership model that holds both data producers and consumers accountable for data quality.
  3. Embrace an iterative development model that emphasizes value and collaboration.

Regrettably, it’s primarily large tech companies like Google and Facebook that successfully implement these three points, often due to their ability to allocate significant resources to the problem. These companies have built effective data development environments through substantial investment in internal tools and processes.

A potential solution to many of these issues lies in the hands of an executive with a technical background. This individual can introduce the necessary frameworks to address the scalability challenges facing all data projects. The C-suite may not necessarily understand the intricacies of AI/ML or data pipelines, but they do care about efficiency, scalability, profitability, and other key performance indicators (KPIs). Until such individuals are included in strategic discussions, chaos may continue to reign.

Businesses have become accustomed to the “move fast and break things” approach, and it may require a high-profile failure in the industry to drive reform. While this might not be the worst strategy, the problem lies in the lack of connection between data products and actual data, as well as the absence of a strong link between data and code. Consequently, it all seems somewhat meaningless. There is little incentive for data product owners to go through the registration process unless it serves a tangible purpose beyond mere documentation.

Categorized in: