Responsible data engineering

Cloud cost optimization: the 50GB report nobody reads

Written by

Bhavya Venu
Data Engineering

Every Monday morning, a healthcare system generates a 50GB claims export. It gets emailed to 47 people. Three of them open it. One person actually uses it.

The pipeline runs successfully. Monitoring shows green across the board. But the compute, storage, network bandwidth, and human time spent managing this process are all spent producing something almost nobody needs. Multiply this by every scheduled export in your organization, and the waste adds up fast.

This is the first article in a series on Responsible Data Engineering, how Tenjumps thinks about building systems that respect financial resources, environmental impact, and the communities where data infrastructure operates.

The anatomy of report waste

Cloud cost optimization starts with understanding what "mass volume data exports" actually looks like in practice. In healthcare and insurance environments, the patterns are remarkably consistent.

Weekly or daily full-database dumps run when incremental processing would yield the same result. "Just in case" reports execute on schedule indefinitely because no one remembers who requested them or why. Excel exports containing millions of rows crash before anyone can open them. Distribution lists that haven't been reviewed in years continue receiving files no one asked for.

These reports accumulate like technical debt. Nobody turns them off because nobody owns them. From an engineering perspective, the jobs look healthy. They complete on time, monitoring shows no errors, and the pipeline moves on to the next scheduled run.

After more than four years working with insurance and healthcare data systems, I have seen this pattern repeatedly. A significant share of automated reports is either unused or dramatically oversized for their actual purpose. In one reporting pipeline, a weekly job generated a 40-50GB claims export every Monday morning, containing detailed records: member information, claim lines, procedure codes, and billing amounts. The export was automatically distributed to a large list across finance, operations, and analytics. Very few people actually needed the full dataset. Many preferred dashboards or smaller extracts. Some couldn't even open the file.

The real cost of a wasteful export

Cloud cost optimization becomes tangible when you break down where the waste occurs in a single report.

Compute costs cover the cluster runtime to generate the export. Storage costs accumulate as historical exports are archived for months. Network and bandwidth costs grow with every distribution cycle. Human time goes toward managing, debugging, and explaining the reports. And the opportunity cost is real: those same resources could power something useful.

When we redesigned a claims reporting process that followed this pattern, the results were measurable:

  • Data volume processed dropped from 40 to 50GB per run to 5 to 8GB

  • Compute time fell from 40 to 45 minutes to 10 to 12 minutes, a reduction of roughly 70 to 80 percent

  • Storage impact decreased because only curated datasets were retained instead of raw full exports

  • Users received smaller, relevant datasets or dashboards instead of massive files that they couldn't open

The techniques that drove those results are straightforward. Delta Lake enables incremental processing so pipelines handle only changed data. Parameterized reporting replaces blanket exports with targeted queries. Usage analytics identify which reports are actually opened. Data lifecycle policies in Unity Catalog automatically enforce retention rules. None of this is exotic engineering. It is disciplined engineering.

The environmental impact nobody talks about

Data has a physical footprint. Every query runs on a server. Every server requires electricity and cooling. Every data center uses water. Every unnecessary export, every stored-forever dataset, every full database dump that could have been incremental contributes to that demand.

Healthcare data compounds the problem because it is frequently duplicated across systems for reporting, analytics, and operational workflows. When those datasets are not managed carefully, the infrastructure footprint grows quickly.

Communities near data centers feel this directly through increased power demand, increased water use, and strained infrastructure. Responsible data engineering means designing pipelines that process only the data that is actually needed and avoid unnecessary duplication or long-term storage of unused data.

Reducing a 50GB weekly export to 5GB is not just a budget win. It is a measurable reduction in energy consumption. Multiplied across every wasteful process in an organization, the environmental impact is significant.

What responsible data management looks like in practice

The Tenjumps approach to responsible data engineering starts with a simple question: Does this report need to exist?

From there, the framework is consistent. Right-size everything by querying only what is needed and storing only what is necessary. Build in accountability through usage tracking and automated lifecycle policies. Design for efficiency by defaulting to incremental over full and on-demand over scheduled.

In practice, this means reviewing query history and pipeline logs in Databricks to identify jobs processing large datasets without clear justification. It means checking BI usage analytics to see how often reports are actually opened. It means auditing scheduled pipelines and distribution lists that have run unchallenged for years.

And it means talking to stakeholders. These conversations often reveal that users only need aggregated summaries or filtered subsets, not the full raw dataset. When we suggested reducing a large claims export, some teams initially pushed back, worried about losing access. We addressed this by showing usage data and working with them to understand what they actually needed. Once users saw that their reporting needs were still met through curated datasets and dashboards, resistance decreased.

This is not just cost-cutting. It is a cultural shift from data hoarding to data stewardship.

How to audit your own data waste

If you are a data engineer reading this, start here:

  1. Inventory your scheduled reports and exports. What is running automatically right now?

  2. Track actual usage. Which reports are opened? Which queries are executed? Tools like Power BI provide usage metrics that answer this directly.

  3. Identify the heavy hitters. What consumes the most compute and storage? A simple query against your execution history will surface the worst offenders.

  4. Question longevity. What data is stored "just in case" versus actually accessed?

  5. Calculate the real cost. Compute + storage + human time spent managing the process.

Start with the worst offenders. The 80/20 rule applies: a small number of wasteful processes likely account for the majority of unnecessary spend. In healthcare organizations, the first place to look is any scheduled claims or census export that distributes full datasets to a broad list of recipients.

The bigger picture

Data engineering decisions have ripple effects: financial, environmental, and social. Every pipeline we build either respects those resources or wastes them. Responsible data engineering is not a separate initiative at Tenjumps. It is how we build.

Next in this series, we will look at data governance as infrastructure, not overhead, and why treating governance as an afterthought costs more than building it in from the start.

What is the most wasteful data process you have encountered? Let's talk about fixing it.

Share