Commissions · CTV · Volume 04

A CTV commercial cut
to fifteen versions, stopped
working at eleven.

An outdoor brand tested fifteen edited versions of a single hero commercial across CTV inventory. The eleventh version was the point at which the audience response collapsed. What the drop revealed about creative fatigue — and about the assumptions underneath modern creative testing.

By Marlow Osoye13 min readCommissionVolume 04
Cover

The brand is a mid-market UK outdoor and adventure company with roughly £34m of annual revenue and a paid media programme that had, in Q1 of last year, made a substantial commitment to connected-TV as part of a broader awareness-and-conversion campaign. The head of paid media — call her Ingrid — commissioned a single hero commercial, produced in-house with a production budget of roughly £180,000, intended to run across CTV inventory in a variety of edited lengths.

The base hero was a 60-second piece with a strong central creative concept, competent production values, and — Ingrid would say — one of the best pieces of brand video work the company had produced in years. The plan, agreed with the brand's paid social and CTV planners, was to test fifteen edited versions of the hero — 60s, 45s, 30s, 20s, 15s, and 10s, in various framings — across CTV inventory over the course of a Q2 window. The intent was to identify which specific edited version produced the strongest response, and to concentrate spend against it for the balance of the campaign.

What they tested

The fifteen versions were, on Ingrid's own description of them, meaningfully distinct rather than superficial variants. They differed in the specific scenes emphasised, the specific voice-over used, the specific call-to-action placement, the specific music choice, and — critically — the specific narrative arc of the piece. Version one used the base 60-second cut. Version two was a 45-second cut that opened on the emotional midpoint of the base. Version three was a 30-second cut structured around the strongest single scene. Versions four through fifteen were, broadly, further variations on these three approaches, with progressively tighter edits and progressively different narrative emphases.

The testing methodology was, by industry standards, unusually careful. Each version was run for a defined period in a defined geographic split, against defined audience segments, with substantially matched control samples. The measurement was primarily audience response — brand-tracking survey delivered to viewers who had seen the version, in the days following exposure — supplemented by direct-response indicators (site visits, product searches, downstream conversion attributable via cookie or IP-based matching).

The results, for the first ten versions, produced a fairly clean data set. Versions one and two — the longer, more atmospheric edits — produced the strongest brand-tracking response. Versions three through six — the shorter, more concentrated edits — produced somewhat weaker brand-tracking response but stronger direct-response indicators. Versions seven through ten — further variations on the tight cuts — produced results broadly consistent with three through six.

Then version eleven ran, and the data set changed.

The collapse

Version eleven, on paper, was not obviously different from the versions preceding it. It was a 20-second cut, structured around a specific scene that had performed well in the earlier tests, with a slightly modified voice-over emphasising a specific product feature. The team had, in the commissioning of version eleven, felt they were producing a stronger version rather than a weaker one.

The audience response was, on the brand-tracking data, materially worse than any previous version. Aided brand recall dropped approximately 22% relative to the median of versions one through ten. Unaided brand recall dropped roughly 30%. Purchase intent — the metric the brand-tracking work was ultimately intended to measure — dropped approximately 18%. Direct-response indicators were also weaker: site visits attributable to CTV exposure dropped roughly 15%; product searches dropped 12%; downstream conversions dropped approximately 20%.

Ingrid's initial reading, when the version eleven data landed, was that the version was defective — that the creative had, in some subtle way, misfired. The team ran version twelve as a diagnostic, using a substantially different structural approach specifically to test whether the version eleven weakness was a version-specific problem. Version twelve produced a similar collapse. Versions thirteen, fourteen, and fifteen — all subsequently produced with progressively different creative approaches — showed the same weakness relative to the earlier versions.

By version fifteen, the campaign's aggregate response metrics were substantially worse than they had been at version ten, despite the individual creative continuing to score well in internal review and despite no obvious change in the platform-side conditions or the target audience.

"The variable that had changed, between version ten and version eleven, was not the specific version. The variable that had changed was the audience's cumulative exposure to the underlying campaign. We had, without realising it, exhausted the campaign's productive reach roughly at the point where version eleven went live."

What this revealed

The team's post-campaign analysis, run in the two months after the campaign concluded, produced a specific reading of the collapse. The variable that best predicted the version-eleven weakness was not the version itself but the cumulative frequency of exposure the target audience had received to the campaign as a whole across the previous ten versions.

On the specific target segment the campaign was running against — a defined demographic in a defined set of CTV inventory pools — the median audience member had, by the time version eleven ran, seen the campaign approximately 14-18 times across the previous versions. This was, in the team's subsequent view, well past the point at which additional exposure was producing productive brand response. The specific creative variation from version to version was, in effect, immaterial after the campaign had accumulated a certain level of cumulative reach against the target audience.

The implication, which the team have now embedded in their creative testing methodology, is that the standard industry practice of testing multiple creative variants against the same audience segment produces reliable data only until the audience has accumulated enough cumulative exposure to the underlying campaign to be, in effect, fatigued. Past that point, the specific variant tested is not the variable being measured; the fatigue is. The team's readings of variants eleven through fifteen were, in retrospect, not measuring the variants at all.

What Ingrid does now

The specific operational change Ingrid has since made is to add a fresh audience segment to every creative test past the tenth variant. Rather than continuing to test additional variants against the same target audience — which produces increasingly unreliable data as fatigue accumulates — she now switches to a new, matched audience segment that has not been exposed to the earlier variants. The data from the new segment is more reliable than continuing on the fatigued segment would have been.

The broader implication, which the industry has been slower to internalise, is that the standard creative-testing methodology — which most modern paid media teams treat as a general-purpose optimisation tool — has a specific range of validity that is set by the target audience's cumulative campaign exposure. Past that range, the tool produces noise dressed as signal. The teams that recognise this range and design their testing methodology around it produce, on our audit sample, better creative decisions than the teams that do not.

The specific number — approximately fifteen to twenty cumulative exposures before the fatigue collapse becomes measurable, on our sample — will vary by category, by audience, by campaign concept, and by platform. The teams that measure their own version of the number and design around it are, on our observation, running more effective testing programmes than the teams that operate without knowing what their own threshold is. Most teams, currently, do not know.