“一个精确的测量值顶得上1000位专家的意见。”——Grace Murray Hopper
How To Get Started In A/B Testing: 6 Tips For Success
By Steve Collins
“One accurate measurement is worth 1,000 expert opinions”
- Admiral Grace Murray Hopper
Those words are probably the single most succinct argument for A/B testing we have. They sum up why any game studio might want to explore A/B testing as an aid to design and development – namely a desire to move beyond guessing what might work, and towards knowing what will.
Anyone who has experienced design by committee or simply HiPPO (highest paid person’s opinion), should by now be aware that there’s a better way to make those decisions. That method is using the real data supplied by real players within an A/B testing, data-driven decision making process.
But it’s one thing to be committed to the concept of testing. It’s quite another to know where to start. On that basis, this article provides 6 tips for successful testing that can help deliver real change to your business. Moving the needle on the metrics we all care about – retention, conversion, and of course revenue. So without further ado…
It sounds obvious, but it’s worth stating – if you want to embed a testing culture in your organization, you’re looking for quick wins and a demonstrated ability to get results. That means adopting a ‘lean’ approach, getting quick validation of the concept, rather than building a large, complex structure before a single test has taken place.
The good news is that’s easy to do. Whatever approach you take technically, there is no serious difficulty in rolling out simple tests. The single largest objection is likely to be fear of ‘changing the game’ – but your game changes (or should change) on a regular basis anyway. Often on the basis of little more than a hunch.
We can also allay that fear by testing relatively uncontroversial aspects of the player experience that can nevertheless have big effects. these could include:
?The delay time before a first interstitial ad is shown to players. We can quickly establish what effect this change has on click-through rate, retention, and other KPIs
?Alternative guide characters for tutorials. I’ve seen real differences in tutorial completion rates based on which character is used. Don’t agonize in design meetings – test it!
?Whether to show a registration screen or not (if appropriate). Is it smart to register players in order to deepen the relationship? Or does trying to do so lead to churn? Find out.
As we’re starting simple, getting true statistical significance in terms of your results may take a little extra work (although that won’t be the case if you’re using an A/B testing platform) – but it’s worth taking the time to do this. One group outperforming another means nothing in and of itself. Don’t take shortcuts here.
We want to encourage frequent and fast testing. In fact, it’s vital that we do, because if an individual test takes too long to set up, requires code changes, and can only be pushed ‘live’ with a new app build, then the sunk cost of a test failure is too high to offset the benefits of our test wins. We end up in a situation in which it’s too risky, expensive or time-consuming to really embrace testing.
Thankfully, it’s easy to tackle the latency issue. The key to doing so is to get engineering out of the process. If every test requires re-coding, QA and sign-off from engineering, we aren’t going to get much done. And besides – they have better things to do.
By de-coupling the testing framework from engineering release cycles, we can create short test cycles, often run by product management or even marketing teams. In order to do this, we need to build our game to be as ‘data-driven’ as possible, with a clear understanding and agreement as to what data points are ‘open’ to testing. When we’ve reached that point, setting up a test can become as simple as changing a value in a spreadsheet or testing platform (which is why marketing can do it!)
Even better, when we’ve established a winning variant, pushing that change out to the live environment is just as simple as setting up the initial test was. We can see test results making an impact quickly – and move on to the next challenge.
This approach also enables us to ‘lock down’ (for the time being) parts of the game which are not open to testing – thus reducing the risks associated with the program.
Implement A ‘Kill Switch’
Sometimes things go wrong. In fact, when we’re encouraging ‘failure’ and creating a agile and adaptive data-driven culture, they often do. So make sure to minimize the impact of failure whilst maximizing learnings from negative results.
Your system must release your team from the ‘fear of failure’ – and the simplest way to do that is the kill switch. You want to have complete control of the test whilst it is live, and you want to be able to disable that test at any moment in time – without waiting for engineering input or a subsequent app release.
Your tests should be seen as overlays over the core game experience. The latter is the default setting, and it is one that ‘works’ – which is not to say it couldn’t use improvement (why are we testing, after all?). But it is essential that our kill switch can at any time return all players to this default state.
The good news is that with the correct A/B test QA procedures (which are relatively straightforward to set up), you will very rarely find yourself need to use your kill switch. But the knowledge that it is there will free you up to experiement with greater purpose and freedom – which is the attitude you will need to make testing a success.
Isolate Your Variables
OK, this sounds obvious. Or rather, it should be obvious. But when tests are being designed it can be very easy to forget this essential principle, leaving you with a positive result but little information as to why we’ve seen success in a particular case.
So it bears repeating: always design your tests to test one thing and one thing only.
As an example of how easily we can forget this principle, consider the creation of a specific offer for a particular player segment – designed to test the effect of changing the price of a specific item in virtual currency. We might decide to ‘support’ that test by delivering an interstitial to the relevant player group. But as soon as we do that, we are now testing two things simultaneously – the price change itself and the use of the in-game message. And we haven’t even discussed the design and content of that message.
The correct way to proceed in this case would be to show interstitials to all groups (variant and control), pitching both prices as an offer and then establishing which one drove greater in-game purchases of the item. If we see an uplift, we can be reasonably confident that it was the price change alone that drove that result – although it would still remain to be tested that this result would hold with no interstitial at all.
Similarly, when testing content changes, such as how items are described, try to focus on clear, repeatable changes. It’s one thing to know that a certain description works more effectively than an alternative. It’s a more powerful result when we learn in a concrete way which type of descriptions work most effectively.
Check Longitudinal Impacts
When testing, you’ll normally define the criteria for success in advance. In fact this step (and recording it) is important. But whilst we should define a conversion event, such as completion of the tutorial or a specific purchase, that is closely linked to the test itself, we should also take the time to examine the longitudinal impacts of the test.
By that, I mean we look in a broader sense at how our variant and control groups perform against certain KPIs over an extended period of time. A moment’s reflection tells us why this ‘sanity check’ is necessary. It would be perfectly possible to design a test in which an aggressive in-game message drive players to a specific purchase. If that offer was deceptive or dishonest, it isn’t hard to imagine a decline in retention – and hence long-term revenue even whilst the core test result is positive.
With that in mind, it’s always smart to look at the ‘whole business’ experience of your variant and control groups.
A word of caution. Always understand what you are looking for, and note in advance any KPIs you fear may be adversely affected. If we look at multiple metrics for multiple tests, it stands to reason that sooner or later we will see what may appear to be significant results. Chances are, however that these lie within normal, expected variation, and on that basis are not meaningful. Limiting ourselves to a specific set of KPIs that we might expect to change minmizes that risk of a ‘false positive’.
Treat New and Existing Users Differently
Selecting the population of users to test will sometimes require a little planning. Having a testing framework that can limit the tests to sub-groups of your users is often very handy indeed. You might want to limit the test to users from a particular region, or to exclude certain users from ever being tested (e.g. your most loyal customers).
One segment of your users that you’ll often want to test separately are your new users. That is, you’ll want to apply tests only to users as they start to use the game for the first time.
Let’s imagine you are testing the layout of your in-game interstitials, which are cross promoting other applications in your network. As part of the test, we are changing the button layouts within those interstitials in order to establish which drives a greater click-through rate.
The problem with this approach is that existing users will be used to the existing UI, will presumably have encountered the interstitials before, and their “learned behavior” will infect the test results. Existing users, may for example, click the wrong place by default, or may be frustrated by the change in UI. As a result you might see spurious results as existing users miss-click or click-in-anger.
Instead, you should limit these types of tests to new users to the game who have not yet “learned” the UI, and where you get a valid assessment of how effective each of the UI variations is on fresh users. This is known as the “primacy effect” in psychology literature and relates to our natural pre-disposition to more effectively remember the first way we’ve experienced something. Your testing framework should allow you to restrict the test to new users only.(source:gamesbrief)