Cost of poor data quality

Written by

Paleti Lakshmikanth

Data Engineering

A single data quality issue cost a team I worked with roughly 50 hours last quarter.

Not because the transformation logic was complex. Not because the platform failed. But because everything around the issue, detection, triage, validation, coordination, and trust recovery, was fragmented and reactive. The actual code fix took about 6 hours. The other 44 were never attributed to the incident in any sprint board, retro, or capacity plan.

According to Gartner, poor data quality costs enterprise organizations an average of $12.9 million per year. That figure comes from a survey of large enterprises already investing in data quality solutions, so the number skews toward organizations with the infrastructure to measure the problem. For mid-market teams, the dollar amount is smaller but the pattern is identical. IBM has estimated that bad data costs the U.S. economy $3.1 trillion annually, a figure highlighted by Thomas C. Redman in Harvard Business Review.

These numbers feel abstract until you see how they materialize inside a single engineering team over one quarter. This is the cost of poor data quality at the ground level, measured in engineering hours, and why most teams underestimate the true impact by 5 to 8x.

Where 50 hours actually go

The incident was a composite of recurring issues observed across a media analytics pipeline. The pipeline combined data from campaign managers (paid media) and business managers (organic data) into weekly snapshots and monthly aggregates. The reports were used for budget decisions and campaign analysis.

Two problems surfaced. First, the same date range produced different numbers on different days because weekly data refreshed and values changed on the next refresh. Second, the sum of weekly impressions did not match monthly impressions. In a stable system, neither of these should happen.

The root causes were layered: late-arriving API data, attribution window changes, source corrections, missing snapshot and versioning logic, and aggregation inconsistencies across time windows.

Here is how the 50 hours actually broke down:

Detection took about 4 hours. Debugging the source versus aggregation mismatch consumed 12. The actual pipeline and logic fix required 6 hours. Reprocessing took 8. Validation across weekly and monthly views took another 10. Stakeholder communication, including incident calls, metric explanations, and reassurance loops, accounted for 8 hours. Prevention work, adding checks and snapshots to stop recurrence, took the remaining 6.

The fix and reprocessing were trackable: 14 hours that showed up in sprint records. The other 36, scattered across detection, debugging, validation, and stakeholder communication, were never attributed to the data quality incident. That is the gap. Teams track the resolution. They miss the investigation, the coordination, and the trust repair.

The pattern across incidents

The 50-hour breakdown above is one incident, but the pattern is consistent across data quality issues I have observed over multiple engagements. Roughly 70% of total effort goes to understanding and validation. About 20% goes to coordination and alignment. Only 10% is the actual code fix.

This leads to a critical conclusion. You do not have a development bottleneck. You have a data reliability and observability bottleneck.

The build versus maintenance reality

Most engineering leaders believe their teams spend about 70% of their time building and 30% on maintenance. In my experience, the actual split is closer to 40% building and 60% debugging, validating, and reworking.

This is not just an internal observation. DORA's State of DevOps research shows that even high-performing teams spend only about half their time on new feature work, with lower performers dropping to 30 to 40 percent. Ascend.io's 2023 State of Data Engineering report found that the majority of data engineers allocate 50% or more of their time to maintaining existing programs and infrastructure. Fivetran research found similar numbers, with about half of engineering effort going to maintenance rather than new value creation.

Poor data quality is the primary driver of this imbalance. It silently converts engineering teams into maintenance organizations. Hiring more engineers does not solve this. More engineers create more pipelines, which create more dependencies. Without strong foundations, complexity grows faster than team capacity.

What I built to fix it

I led the stabilization of the reporting layer in Databricks by addressing the inconsistencies across time windows and rebuilding the pipeline architecture from ingestion through reporting.

The first step was replacing overwrite patterns with append-only Delta Lake tables and implementing snapshot versioning. This eliminated the core problem: historical data changing on refresh. I partitioned data by event date and ingestion timestamp so every snapshot was reproducible. Delta time travel became the primary debugging tool, allowing us to compare any two points in the pipeline's history without reprocessing.

Next, I implemented data quality validation using Delta Live Tables. DLT expectations enforced checks at the ingestion and transformation stages: null checks on critical fields like campaign ID and date, uniqueness constraints for deduplication, and conditional checks to catch metrics outside valid ranges. Critical failures halted the pipeline. Non-blocking anomalies were logged for review. This shifted data quality left. Issues were caught during pipeline execution, not after dashboards broke.

I refactored the pipeline into a medallion architecture: Bronze for raw ingestion that could handle late and mutable API data, Silver for cleaned and deduplicated records validated through DLT, and Gold for aggregated reporting tables where weekly and monthly views were built from the same source logic.

To address the weekly versus monthly mismatch directly, I built a reconciliation framework as a scheduled Databricks job. The job compared summed weekly metrics against monthly aggregates, applied threshold-based alerting for mismatches, and stored results in a reconciliation audit table. Finally, I introduced data finalization logic: a 48 to 72 hour latency window for late API updates, after which records were marked as final. The Gold layer was built only from finalized data, which eliminated metric drift.

The results

The impact was measurable across every category that had consumed time in the original incident.

Weekly versus monthly mismatches, which had been frequent and required manual investigation after every refresh, were eliminated. Reconciliation effort dropped from 6 to 10 hours of manual validation to less than 1 hour with automated jobs and DLT checks. Debugging time fell from 10 to 15 hours to 2 to 4 hours using Delta time travel and DLT data quality logs. Pipelines that previously required multiple reruns per week became stable with built-in validation gates. Data quality issues that had been detected late through dashboards and analyst escalations were now caught early at the ingestion and transformation stages. Stakeholder escalations, which had been frequent, dropped to near zero.

The inefficiencies compound beyond any single incident. Every unnecessary reprocessing cycle, every duplicated dataset, and every unoptimized pipeline consumes compute, storage, and engineering time that could go toward building something new. This is the same pattern behind the hidden cost of wasteful data exports, where a single oversized weekly report was consuming resources almost nobody used.

The missing layer: data contracts

The deeper issue behind most data quality incidents is system ambiguity. If you cannot answer the question "when is this data final and immutable?" then you do not have a pipeline. You have continuous uncertainty. The finalization logic I built, a defined latency window after which records become immutable, is one form of a data contract.

A data contract is a formal agreement between a data producer and a data consumer that defines schema expectations, update frequency, finalization rules, and allowed mutations. It shifts the responsibility for data quality upstream, where problems are cheaper to catch and fix.

Data contracts are still an emerging practice, but adoption is accelerating. Organizations like Miro have moved from embedding contracts in pipeline code to expressing them as structured specifications, and platforms including dbt, Confluent, and Databricks are building native contract support into their tooling. The 2024 State of Data Engineering reported that real-world success stories around data contracts are emerging even as the practice remains early-stage. Most teams have not implemented them yet, and that gap is where the hidden hours accumulate.

Data is where the work begins

At Tenjumps, our work begins at the foundational level, and that foundation is an organization's data. The cost of poor data quality is not just lost time. It is lost engineering momentum, reduced team morale, slower decision-making, and the gradual erosion of data as a strategic asset.

In our experience, one hour of prevention consistently saves 5 to 10 hours of downstream response. Yet most teams allocate almost nothing to proactive data quality work.

If your numbers change after a refresh, if weekly does not match monthly, if your team spends more time explaining data than using it, the problem is not the pipeline. It is a decision reliability problem.

Start the conversation with our data engineering team.