How to audit six months of return fraud without hiring a data team
You have a hunch. Return rate is creeping up, refund spend is higher than last quarter, and two chargebacks landed last month that felt a lot like coordinated attempts. You want to know: has this been happening for a while, and you're only noticing now?
The honest answer most merchants arrive at is that Shopify doesn't help you look backwards. The native return view is a rolling forward feed. Fraud detection apps score new returns as they come in. Neither one tells you what your last six months actually looked like if you'd been scoring all along.
Running that retrospective yourself is a real engineering project. Below is what it actually takes, and why most small teams quietly give up two days in.
What a proper retrospective fraud audit requires
If you wanted to recreate six months of risk scores by hand, the shopping list looks like this.
1. Every return, with full order context
Not just the return row. You need the order it came from, the customer who placed it, the line items, the discount codes, the fulfillment timeline, and the shipping address. A return in isolation tells you almost nothing. A return on a discounted order from a brand-new account shipping to an address that appears on four other accounts is a different story entirely.
In practice that means pulling from at least four Shopify Admin API endpoints (returns, orders, customers, fulfillments) and joining them in something that isn't a spreadsheet. Sixty thousand rows of returns with their related orders is perfectly normal for a mid-market store, and it will not load cleanly into Excel.
2. Every chargeback in the same window
Chargebacks are the ground truth of return fraud. A chargeback means a customer disputed a charge they probably shouldn't have, and the bank agreed. But Shopify stores chargebacks in a separate domain (shopifyPaymentsAccount.disputes) with its own pagination, its own ID format, and its own status machine. You pull them, then link them back to orders by matching the order reference inside the dispute payload.
Two common gotchas. Disputes don't carry the return ID directly, so a dispute on an order that had a partial refund plus a return chargeback needs careful joining. And you'll find disputes on orders that had no return at all. Those are worth looking at, but they don't belong in the return-fraud audit itself.
3. Customer-level aggregates across the whole window
The interesting signals only show up when you look at a customer's full pattern, not a single return. "How many times has this address been used by different accounts?" or "how long between this customer's account creation and their first refund?" are questions you have to build aggregates for. For a six-month window with a few thousand customers, that's hundreds of thousands of aggregated facts.
4. A scoring model that weights the signals sensibly
Raw counts aren't a score. "Customer has returned 3 times" is meaningless without context. Is that 3 out of 4 orders (80% return rate, probably abuse) or 3 out of 30 (healthy)? You need a model that weighs signals against each other, caps extreme values, and produces a number you can sort on.
Most teams who try to build this themselves end up with a brittle rules engine and a week-long argument about threshold values.
5. A way to keep it current as new returns come in
One-time audits are useful. But the minute you finish, the next week of returns hasn't been scored. Either you accept that the audit is a snapshot that goes stale immediately, or you build the whole pipeline to run continuously. Which is a second project, usually bigger than the first.
Why most teams quietly give up
The tools exist. Shopify's APIs are well-documented. Postgres is free. You can hire a contract data engineer for three weeks and have all of this running.
In practice, here's what usually happens:
- Week 1. CSV exports from Shopify admin, a spreadsheet full of
VLOOKUPformulas, and a growing sense that the scoring logic is going to need its own document. - Week 2. A proof-of-concept script that pulls the API, joins the data, and calculates one or two signals. It works on fifty rows and times out on five thousand.
- Week 3. You realize the chargeback API is a separate integration, the rate limits are tighter than you thought, and nobody on the team knows how to pick thresholds that aren't just a guess.
- Week 4. The project gets shelved "until we have someone on staff who can own this."
The part everyone underestimates isn't the code. It's the judgment calls about which signals to trust, how to weight them, and how to avoid reacting to statistical noise from small samples. Those calls are worth hiring for if you're going to make them constantly across every merchant. For a single store, it's rarely the best use of three weeks of engineering time.
What the automated version looks like
Here's the shape of the answer when a tool handles it for you, so you can judge whether it's worth building vs. buying.
On install, the tool should do the following in order, without you touching anything:
- Pull every return in a configurable window (three, six, or twelve months) in paginated batches, storing the original Shopify GIDs so duplicate webhooks never create duplicates later.
- Run the same scoring model against those historical returns that will score new returns going forward. If the model changes the day after install, the old scores are stale, so the store keeps every input snapshot, not just the final score, and can rescore on demand.
- Pull every Shopify Payments dispute in the same window, link each one to the return it came from where possible, and mark the linked return as confirmed fraud (or closer to it). Those confirmed outcomes are the training signal for every weight adjustment that happens later.
- Aggregate the historical returns into per-customer profiles so the very first new return that lands post-install already has years of context attached to the customer. Repeat rate, average refund amount, velocity, shared-address clusters, all of it.
- Run aging inference. If a return has been sitting with no refund for long enough that a chargeback is effectively impossible, classify it as legitimate without asking you. That one pass alone gives the scoring model thousands of labels it would otherwise never get.
The whole chain should be observable. You should see "backfill scored 14,283 returns in 6m 12s, linked 47 disputes, auto-labeled 11,204 as legitimate based on aging" in a log you can grep.
And critically, none of this should consume your live scoring quota. Backfill is an install-time operation, not a metered one.
What you can actually do with the output
Once you have six months of scored returns plus the disputes linked up, a few questions become easy to answer that were previously impossible.
"Which customers have we refunded the most, ranked by risk score rather than refund amount?" Refund amount alone surfaces your best customers too. Score-weighted refund spend is the actual fraud loss metric.
"How many returns in the last quarter shared a shipping address with three or more other accounts?" The signature of coordinated fraud rings, and invisible in any per-return view.
"What's our chargeback rate on returns the model would have flagged as HIGH risk, if we'd had it running?" If the answer is meaningfully above the unfiltered rate, you've just validated the model's calibration on your own data.
"Are there specific discount codes that correlate with elevated return fraud?" Promotion abuse is one of the hardest patterns to see without aggregating across a long window.
These aren't weekly reports. They're the ones you look at once per install, to understand the shape of the problem you're solving. After that, forward-looking scoring takes over.
Takeaway
The reason retrospective fraud audits don't happen isn't that merchants don't care. It's that the manual version is a three-week project and the payoff is a snapshot that goes stale. If you're going to bring the data together anyway, the right moment is install day. The scoring model, the customer aggregates, and the chargeback sync can all happen in a single coordinated pass.
If you're evaluating fraud tools, one question worth asking every vendor: "on day one, what do I know about my last six months of returns that I didn't know before I installed you?" If the answer is "nothing, we start scoring the next new return," you'll spend your first quarter with the tool effectively blind to patterns that have been active for a year.
RefundSentry runs the backfill automatically the moment you install. Historical scoring, dispute sync, customer aggregates, aging inference, the full chain. You can watch it progress in a log, and every historical return is scored with the same model that scores your next new one. No SQL, no data team, no three-week project.
Worth trying if you've been putting off the audit you know you should run.