Incrementality has been the industry buzzword for a while now, and while methodologies have evolved over time, many questions remain when…
Incrementality has been the industry buzzword for a while now, and while methodologies have evolved over time, many questions remain when it comes to measure how incremental a campaign is. Which is the best method? How often should marketers test for incrementality? Is it possible (and/or desirable) to compare incrementality results?
Our panel at App Growth Summit Berlin gathered top marketers from Omio, Shpock, Blinkist and DeliveryHero to discuss their experiences measuring incrementality in User Acquisition & App Retargeting.
There was a general consensus in our panel that it all started with Retargeting. Indeed, Lars Engelbracht from Omio (former GoEuro) went as far as saying the concept started coming up with retargeting on desktop, where the “kick-off” question, so to speak, was “if I already have these users, they’re my users, why do I have to pay for them again?” Specially, as Trixie Ann Garcia from Shpock pointed out, when some retargeting channels like push and email are practically free.
As Marichka Baluk from Blinkist rightly phrased it: “is there an actual added value to running retargeting campaigns on mobile?” Basically, am I able to change user behavior by showing ads?
A few years (and incrementality tests later) the general questions seems to have shifted from can I impact user behavior through app retargeting, to what is the best way of measuring that impact.
This is no minor challenge, indeed there’s a trade-off. A 100% scientific test would require pausing all other marketing activity, to completely isolate the effect of the mobile display ads. This is not always possible, and even if it were possible, it’s hardly something that marketers can do all the time. As Marishka Baluk from Blinkist pointed out, stopping all marketing activity is less than ideal:
“… in one of my previous jobs, we were actually trying to pause all the marketing activity in smaller markets, where we could afford it, for a certain period of time. [The idea was to] then start again to basically measure against the baseline (…) the issue there is that most of the channels do not recover so well (…) and then you cannot really ramp up to the level you were, and if you do that a couple of times you can kill all the campaigns and the optimizations.”
Even cutting other marketing channels such as Facebook and Google, for big brands, there are still other variables than can impact the campaign and data.
“There’s so much background noise, you’d really need to cut all marketing activity, but even then, I think offline channels such as TV will have an impact on branding awareness and top funnel.” Tom Brooks, DeliveryHero
At the end of the day, it does seem to be about finding the best compromise between the most scientific set up, and the set up that supports the business goals. That said, there are some basic parameters that need to be considered for the test to be as scientific as possible.
It’s important to understand that Incrementality tests answer only ONE question: Am I able to change user behavior by showing a specific type of ad (in our context, mobile ads).
Then there are different types of tests and analysis, which are useful for marketers seeking to compare the performance of certain creatives, or to better understand the performance of different segments. There are also experiments and predictive models to evaluate the impact of seasonality. But Incrementality tests? Incrementality tests are only useful if you are trying to measure the impact of showing an ad to your users. A typical, robust, incrementality test will provide you with:
There are different ways of defining the audience size to make sure the test is statistically significant. At Jampp, we estimate the number of needed exposed users using the organic event rate of the audience to be tested and some mathematical parameters like the Confidence Level and the MDE.
“If the total share of installs coming from the mobile channel is 2–5%, any variable we could see, maybe it’s not going to be enough to indicate definitely that it has been a huge impact or not.” — Tom Brooks, Delivery Hero.
To this point, we might add (to the delight of the Data Scientists in our team 😉) that daily or partial results don’t make sense for incrementality tests.
While different methodologies will tackle this in different ways, it’s key to compare exposed users vs non-exposed users. One way of achieving that, as explained by Trixie on the panel, is to divide the randomized audience in two groups: Test & Control. The first users are exposed to the marketing ads, while the second group will view PSA (Public Service Ads/ non-related ads).
As Lars Engelbracht, pointed out, the PSA do make the test more costly than on other channels. However, using PSA on the control group (and View-Through Attribution) allows marketers to compare the exposed users of both groups, thus ensuring that two “similarly reachable universes” are compared, eliminating all users that were out of reach.
Others go the way of Ghost Bidding, this method also compares a Treatment Group versus a Control Group but it doesn’t serve ads to the latter. Instead, it simulates auctions on the Control Group taking into consideration the probabilities of winning and rendering. A machine-learning model is used on the recorded bids from both groups, in order to obtain predicted impressions in both sets. The predicted impressions are used in order to build a Control that would have been exposed otherwise so you can track their organic behaviour. This organic behavior will be compared against the behavior of the Treatment population, and the difference between one and another is what is called “incrementality”.
Facebook tests Conversion Lift with an “intent-to-treat” methodology. They randomize everyone on Facebook, Instagram, and Audience Network into two groups: the test group (will see the advertiser’s ads) and the control group (will be held back from seeing any of the advertiser’s ads). The campaign will include people who are eligible to see the advertiser’s ads in both the test and control groups. In the test group, ads run like they normally would. In the control group, people don’t see the ads (but would have seen if they weren’t part of the hold out). The study captures sales outcomes for both groups and when the study is complete, Facebook calculates the incremental impact by comparing the test and control groups.
Trixie Ann was the first to highlight the importance of ensuring the randomization of the sample.
“One of the most important things that you should consider when you do incrementality tests is the randomization of the audiences, because it could well be that you have very high incrementality, but there was a certain bias to the two groups that you have. This is one of the problems that we’ve come across”
There are indeed different ways of ensuring the sample is randomized such as splitting users with a last-character logic, or using an open source randomization code.
Different channels will likely have different audiences. Additionally, depending on the platform/ service you use to purchase traffic, it will probably be harder to conduct tests with the same/ similar methodology and variables, rendering the results incomparable.
“I don’t think it makes much sense to compare across channels, as you have different objectives for each, and it depends a lot on your KPIs” — Tom Brooks, DeliveryHero
As Lars from Omio rightly pointed out, a lot of variables go into the mix. Their incrementality results varied from market to market, and the moment of the year. Their app is highly influenced by seasonality, and user behavior is not the same for their local market (Europe) as in other markets. It’s important to consider this type of characteristics prior to running any test.
This is a great point to highlight, especially since so many marketers are keen to know how their incrementality results compare to others’. While the curiosity is certainly understandable, it’s next to impossible to provide a benchmark considering the amount of variables that influence performance. Going back to Lars’ observation: results can and do vary for the same app. So it’s very difficult to compare across apps.
So what’s a good uplift? If the test is conducted with a robust methodology, and the result is positive, it’s a good uplift.
Why? The Control Group will show the Organic Conversion Results (users who triggered the event without being prompted by an in-app Ad), while the Treatment Group will show Organic and Attributed. Comparing the Treatment Group against the baseline will allow you to see the Incremental value of the campaign. Organic data is used to estimate the number of exposed users needed in each group to guarantee a statistically valid test.
As technology and consensus among marketers evolve, we’ll likely continue to see the development of new and improved methodologies to facilitate this analysis, either way, what’s common to all methodologies? Apples to Apples: Your campaign doesn’t have 100% reach, so you need a Treatment/ Test Group and a Control / Holdout Group to ensure you are comparing the exposed users against would-be-exposed users for a sound test.
We are always happy to chat with marketers about performance and growth marketing. This type of events offer a great opportunity to share experiences, learnings and challenges. Big thanks to the marketers that joined us on stage! 👏