How Delta merge fixed duplicate insurance claims in Databricks pipelines

How Delta merge fixed duplicate insurance claims in Databricks pipelines

Databricks Delta merge strategy
Databricks Delta merge strategy

Learn how a Databricks Delta merge strategy eliminated 90%+ of duplicate claims and restored reporting trust.

By Bhavya Venu, Data Engineering

When the numbers changed and no one could explain why

Picture this: a finance team at an insurance company pulls their Monday morning claims report. The total looks right. Two days later, a colleague runs the same report for the same date range and gets a different number. No one touched the data. No one changed the filters. The report just disagrees with itself.

This wasn't a one-time glitch. It was a recurring pattern. And with each unexplained discrepancy, the damage compounded, not just to the numbers, but to something harder to rebuild: trust in the data platform itself.

Business users began questioning every dashboard. QA teams were spending hours running manual SQL comparisons just to understand what had changed. UAT and SIT sign-offs stalled. The platform wasn't broken in any way, so that monitoring could detect jobs completed successfully, pipelines ran on schedule, but the data coming out the other end was quietly, consistently wrong.

The pipeline was healthy. 
The data wasn't. And the gap between those two realities is where the real problem lived.
The pipeline was healthy. 
The data wasn't. And the gap between those two realities is where the real problem lived.
The pipeline was healthy. 
The data wasn't. And the gap between those two realities is where the real problem lived.

The root cause: Late adjustments in an append-only world

In insurance, data doesn't arrive all at once. Daily files carry original policy and claims records, but corrections and adjustments follow days later, sometimes two to four days after the original submission from a separate upstream feed.

The pipeline hadn't been designed with this reality in mind. It was built on a simple append model: each day's file loaded into Delta tables as new rows. There was no mechanism to recognize that an incoming record was a correction to something already loaded. When a corrected claim arrived on Wednesday with the same claim_id as a record loaded on Monday, the pipeline treated it as a new claim and inserted it again.

Over time, the Delta tables accumulated multiple rows for the same claim. Report totals climbed. Record counts inflated. And because every job completed without errors, nothing in the monitoring layer raised an alarm.

The technical problem: Append-Only loads and late claim adjustments

This section walks through the engineering problem in detail: how it manifested, why it was difficult to detect, and what made it a particularly common pitfall in insurance data pipelines.

Pipeline architecture: What was in place

The pipeline ingested insurance policy, claims, and member eligibility data from daily source files into Databricks Delta tables. Two upstream systems fed the pipeline: one delivered original records, the other delivered adjustment files, on a separate, asynchronous schedule.

The load logic was append-only. Each daily file was processed and inserted as a batch of new rows. The assumption baked into this design that each daily file represented the final version of that day's records turned out to be fundamentally incorrect for insurance data.

Why late adjustments break append-only pipelines

Insurance claims frequently require correction after initial submission. Billing adjustments, eligibility changes, and coding corrections are normal operational events. When these arrive two to four days after the original record, an append-only pipeline has no way to reconcile them. The corrected record simply becomes a second row.

The composite business key — claim_id plus claim_line_number — was not being used as a deduplication mechanism. The pipeline didn't know these two rows represented the same real-world event.

Why standard monitoring missed it

Job-level monitoring confirmed that pipelines were completed successfully. Row counts appeared healthy at the point of ingestion. The degradation was in the semantic integrity of the data, a category of quality issue that infrastructure monitoring is not designed to catch. It required comparing source record counts to target table counts, and validating claim amount totals against control files checks that weren't built into the pipeline's automated testing layer.

Impact: What broke and what it cost

The downstream consequences of the duplicate data problem touched every layer of the organization that relied on claims reporting:

  • Financial claim totals were inconsistent between report runs for the same date range, making it impossible to trust dashboard figures

  • Manual reconciliation work increased significantly, requiring QA and data engineering teams to run ad hoc SQL comparisons to diagnose discrepancies

  • Business users lost confidence in the data platform, escalating to questions about the reliability of the entire reporting infrastructure

  • UAT and SIT approval cycles were delayed as testers could not sign off on reports showing inconsistent totals

  • Support ticket volume related to claim mismatches increased, diverting engineering time from development to triage

The solution: Three engineering changes that fixed the pipeline

The fix wasn't a single patch, it was a redesign of how the pipeline handled incoming data, with three coordinated changes that addressed the root cause and added structural safeguards against recurrence.

1. Replace append-only loads with Delta Merge

The most important change was replacing the append-only INSERT pattern with a MERGE operation on the claims table. The merge uses the composite business key — claim_id and claim_line_number — to determine whether an incoming record is new or a correction to an existing one. If the record already exists, it updates the relevant fields. If it's genuinely new, it inserts.

MERGE INTO claims_target tUSING claims_staging s  ON t.claim_id = s.claim_id  AND t.claim_line_number = s.claim_line_numberWHEN MATCHED THEN  UPDATE SET    t.claim_amount          = s.claim_amount,    t.claim_status          = s.claim_status,    t.last_col_updated_date = current_timestamp()WHEN NOT MATCHED THEN  INSERT *

This single change eliminated the primary duplication vector. Corrected records now update existing rows rather than creating new ones.

2. Add a 7-Day reprocessing window

Even with merge logic in place, the pipeline still needed a mechanism to capture adjustments that arrived outside the original load window. The solution was to extend each daily run to reload the previous 7 days of data, not just the current day's file.

This reprocessing window ensured that late-arriving adjustments would be picked up and merged into the correct records automatically, without requiring manual intervention or backfill jobs.

3. Build pre-publication data quality checks

Before final tables were promoted for reporting consumption, the pipeline now executes three automated validation checks:

Duplicate key detection:

SELECT claim_id, claim_line_number, COUNT(*) AS cntFROM claims_targetGROUP BY claim_id, claim_line_numberHAVING COUNT(*) > 1;

Source-to-target record count comparison:

SELECT  (SELECT COUNT(*) FROM claims_staging) AS source_count,  (SELECT COUNT(*) FROM claims_target)  AS target_count;

Claim amount total validation:

SELECT SUM(claim_amt) FROM claims_target;

These checks run automatically after each load. If any check fails a threshold, the pipeline halts before data reaches reporting tables, preventing bad data from propagating downstream.

4. Audit table for run-level traceability

A lightweight audit table was added to track key metrics for every pipeline run: run date, source record count, rows inserted, and rows updated. This made debugging significantly faster by providing a queryable history of what happened in each execution.

-- Example audit log insert (values sourced from pipeline variables)INSERT INTO pipeline_audit (run_date, source_count, inserted_rows, updated_rows)VALUES (current_date(), 12000, 300, 150);

Results: What changed after the fix

The impact of the changes was measurable and immediate following deployment into the test environment:

  • Duplicate claims reduced by more than 90%, confirmed through the duplicate key validation queries

  • Report totals stabilized, the same report run on Monday and Wednesday now returned the same result for the same date range

  • Manual SQL reconciliation work dropped significantly, as the automated checks replaced the ad hoc investigation process

  • Business teams regained confidence in dashboard data, with stakeholders able to rely on published report figures

  • QA defects related to claim total mismatches declined noticeably, accelerating UAT approval cycles

  • Support tickets for claim data discrepancies decreased as the source of recurring mismatches was eliminated

Key lesson for data engineers working with insurance data

In insurance data pipelines, late corrections are not edge cases, they are expected behavior. Never design a pipeline that assumes daily source files are final.

The append-only pattern is appropriate for immutable, event-driven data. It is not appropriate for insurance claims, policy records, or member eligibility data, where corrections and adjustments are a routine part of the data lifecycle.

Building merge logic and reprocessing windows into the pipeline from the start, and adding pre-publication data quality checks, is not over-engineering. It is the minimum viable architecture for insurance data that needs to be trusted.

Engineering checklist for insurance claims pipelines

  • Use MERGE (upsert) logic based on composite business keys, not append-only inserts, for claims and policy tables

  • Include a rolling reprocessing window (5–7 days minimum) to capture late-arriving adjustments automatically

  • Add pre-publication duplicate key checks as a pipeline gate — do not allow duplicate keys to reach reporting tables

  • Validate source-to-target record counts and financial totals as part of every pipeline run

  • Implement an audit table to support debugging and provide a run-level history of insertions and updates

  • Treat data quality monitoring as a first-class engineering concern, separate from job-level infrastructure monitoring

Contact our Data Engineering team to explore how Databricks can support your transformation.

Frequently Asked Questions

Why do my insurance reports show different totals for the same date range?

This is almost always a sign that your data pipeline is using append-only load logic for tables that receive late corrections. When adjustment records arrive days after the original claim and are inserted as new rows rather than used to update existing ones, the same claim gets counted multiple times. Each time the pipeline runs, the total climbs even though the underlying business data has not changed. Switching to a merge-based load strategy using a composite business key (such as claim_id plus claim_line_number) resolves this at the source.

When should a data pipeline use MERGE instead of append (INSERT) for claims data?

Use append-only loads only for data that is genuinely immutable: event logs, audit trails, or timestamped transactions that are never corrected after the fact. For insurance claims, policy records, and member eligibility data, corrections are a routine part of the data lifecycle. Any table that receives updates or corrections from upstream systems should use MERGE logic with a reliable business key. The cost of implementing merge upfront is far lower than the cost of remediating duplicate data after it propagates into reporting tables.

How does Databricks Delta Lake handle late-arriving data?

Delta Lake supports ACID transactions and MERGE operations natively, making it well-suited for handling late-arriving data. The MERGE INTO command lets you define exactly what happens when an incoming record matches an existing one, updates specific columns, leaves others unchanged, or applies conditional logic. Combined with a rolling reprocessing window (reloading the last 5–7 days on each pipeline run), Delta Lake provides a robust foundation for managing data that does not arrive in perfect chronological order.

Can a data pipeline appear healthy in monitoring while still producing bad data?

Yes, and this is one of the most common and costly misunderstandings in data engineering. Infrastructure monitoring confirms that jobs completed and pipelines ran on schedule. It does not confirm that the data produced is semantically correct. Duplicate records, incorrect totals, and stale values can all accumulate silently while every job status shows green. Data quality monitoring, validating record counts, checking for duplicate keys, and comparing financial totals against source control files, is a separate discipline that must be built into the pipeline as a first-class concern.

What is the right reprocessing window for insurance claims pipelines in Databricks?

A 7-day rolling window is a practical starting point for most insurance claims pipelines. In the case documented here, late adjustment files typically arrived 2–4 days after the original submission. A 7-day window provided a sufficient buffer to capture the majority of corrections automatically, without requiring manual backfill jobs. The right window for your environment depends on the SLA of your upstream adjustment feed work with source system owners to understand the maximum lag before setting this parameter.

How do you rebuild business trust after a data quality incident in reporting?

Trust is rebuilt through consistency and transparency, not just through a technical fix. After resolving the underlying pipeline issue, the most effective steps are: making report totals demonstrably stable across multiple runs, communicating clearly to business stakeholders what caused the discrepancies and what changed, and introducing visible data quality indicators, such as audit logs and validation run summaries, that give non-technical users ongoing confidence in the numbers they see. A single clean report does not restore trust. A sustained pattern of accurate, stable output does.