Return reason clustering: why your return data is hiding the real problems

You opened your returns dashboard this morning. 47 returns waiting. You scan the reasons column: "too small", "didn't fit", "runs small", "sizing was off", "not as expected", "wrong size", "smaller than pictured."

Seven different reasons. One underlying problem: your size chart is wrong.

Your reporting tool shows you seven separate data points. You chase seven separate tickets. You never fix the size chart.

This is why most Shopify merchants are flying blind on returns.

The flat-text problem

Shopify gives you a dropdown of return reasons (Defective, Wrong Item, Doesn't Fit, a few others) plus a free-text field where customers can write anything. In practice customers ignore the dropdown and type whatever. Or they fill both, and the free text contradicts the dropdown.

The result is a returns dataset that's technically structured but semantically fragmented. The same issue gets written a dozen different ways:

"too small" / "runs small" / "sizing off" / "didn't fit my usual size": your size chart is misleading
"stitching came apart" / "broke on first use" / "poor quality" / "fell apart after one wash": a manufacturing defect in a specific product batch
"Item not as described" / "color looks different" / "not what I expected": could be bad product photos, could be deliberate misrepresentation fraud

Looking at these as individual strings, you see noise. Grouped by meaning, you see signal.

How semantic clustering works

Reason clustering uses language model embeddings to measure the semantic similarity between return reasons. Not whether the words match, whether the meaning matches.

"Runs small" and "Taille trop petite" flag as the same issue even though they share zero characters. "Defective zipper" and "zipper broke immediately" cluster together even though a keyword search would treat them as different categories.

Three stages:

Embed each return reason into a vector space. Each piece of text becomes a point in high-dimensional space, where similar meanings are geographically close.
Cluster nearby points. Algorithms like DBSCAN or k-means group returns within a similarity threshold. Tight threshold surfaces only very similar reasons. Loose threshold groups broader themes.
Label each cluster. A language model reads a sample from each cluster and writes a plain-English label: "Sizing/Fit issues", "Quality defect, stitching", "Did not arrive (potential fraud)".

What you get is a live map of your return problems, not a list of individual complaints.

Three use cases that actually move the needle

Sizing and fit issues pointing at bad size charts

The most common and most fixable problem clustering surfaces.

A merchant selling women's dresses sees returns trending up month over month. The raw data has 40 different reason strings. Clustering shows 62% of all returns for their "Linen Midi" SKU fall into a single "Runs small / sizing discrepancy" cluster.

The fix is obvious once you can see it. Update the size chart, add a "size up" note to the product page, or recut the pattern. Merchants who act on this data typically see 20% to 35% return-rate reductions on affected SKUs within two months.

Without clustering you'd see "too small" (12 returns), "sizing off" (9 returns), "runs small" (8 returns), and treat them as three separate low-volume complaints.

Quality defect clusters pointing at product problems

Quality issues are expensive to miss. A defective batch from a supplier can generate returns for months before anyone notices the pattern.

Clustering catches it early. If 15 returns over three weeks all land in a "Stitching/construction defect" cluster, and those returns concentrate on items from a specific restock date, you have enough to open a supplier conversation. Or pull the inventory.

One outdoor gear merchant caught a defective zipper batch on a bestselling jacket because clustering flagged an unusual spike in defect-related returns on one colorway. Manual review of the raw reasons would have taken weeks. The cluster appeared within days of the first returns landing.

Coordinated fraud, where reasons turn into scripts

The use case most merchants don't think about until it's too late.

Fraud rings coach their members on what to say. When 8 accounts submit returns within 48 hours and all 8 use near-identical language ("Item did not arrive, tracking shows delivered but nothing was at my door") that's not coincidence. That's a script.

Real customers describing the same experience vary their language. Legit "didn't arrive" complaints look like:

"Package says delivered but I checked everywhere and it's not here"
"Tracking shows delivered but nothing was left"
"Says it was dropped at the door, not there"

Similar meaning, different phrasing.

A fraud ring looks like the same sentence, repeated across 8 accounts. The semantic distance between those reasons is near zero.

Clustering flags this automatically. When a cluster forms faster than organic variation would explain, especially alongside account-age, velocity, and IP signals, that's a strong indicator of coordinated fraud.

How this differs from Shopify's built-in reason tracking

Shopify's native returns show you the reason enum (Defective, Wrong Item, Doesn't Fit) plus the customer note. You can filter and sort by enum. Useful for basic reporting.

What you can't do natively:

Group semantically similar free-text reasons that use different words
Detect when the same reason phrase appears across unrelated accounts (fraud signal)
See which SKUs have reason clusters growing unusually fast this week
Identify reason language that historically correlates with higher fraud rates

Clustering operates on the layer beneath the structured data. It reads the meaning, not the category.

What good cluster reporting looks like

A useful dashboard surfaces four things.

Cluster name and size. "Sizing/fit issues, 34 returns (28% of total)" tells you immediately where to focus.

SKU breakdown within each cluster. Which products drive each cluster? One SKU with 20 sizing returns is a size-chart problem. Twenty SKUs each with 1 or 2 sizing returns is probably normal variance.

Trend over time. Is a cluster growing? A defect cluster doubling week over week is an active problem, not historical noise.

Fraud risk signals. For clusters where the semantic similarity is unusually tight, returns that read like copies of each other, flag for manual review before refund processing.

Getting started

Return reason clustering is available on RefundSentry Pro. Once you connect your Shopify store, clustering runs on your historical returns within minutes and keeps updating in real time as new returns come in.

The dashboard lives under Analytics > Return Reasons: cluster labels, SKU breakdowns, trend lines, and fraud risk flags, all derived from the free-text reasons your customers are already writing.

If your return rate is above 5%, there is almost certainly a fixable root cause hiding in your reason data. Clustering is how you find it.

Reason clustering is one of five analytics capabilities Shopify doesn't offer natively. For the full breakdown, see The return analytics Shopify doesn't give you.

Return reason clustering: why your return data is hiding the real problems

Return reason clustering: why your return data is hiding the real problems

The flat-text problem

How semantic clustering works

Three use cases that actually move the needle

Sizing and fit issues pointing at bad size charts

Quality defect clusters pointing at product problems

Coordinated fraud, where reasons turn into scripts

How this differs from Shopify's built-in reason tracking

What good cluster reporting looks like

Getting started

Stop return fraud before it costs you

RefundSentry Team

Continue Reading

Bracketing: when buying five sizes is normal and when it's abuse

How to audit six months of return fraud without hiring a data team

Why is my Shopify return rate so high? A diagnostic