避免手机游戏中的A/B测试失控

发布时间：2015-11-18 16:08:53 Tags：A/B测试,IAP,UI,赛车,软货币

作者：Andrew Kope

几乎每天我都会听到“A/B测试”这个词。它经常被作为游戏设计决策制定的万能药—-用于删除等式中一些人为偏误并基于数据作出决定，因为“数字永远都不会撒谎。”而在这里我并不是要说A/B测试是无效的，而是与我所处理的许多事情一样，很多问题都是隐藏在细节中。

我们发行了一款免费赛车游戏，在这里用户能够通过参加比赛而赚取软货币，并且他们可以消费软货币去购买新车或升级自己的汽车。你可以每天消耗一个行动点并只参加10场比赛，并选择通过IAP购买更多行动点或软货币。游戏的用户留存很棒，但是因为游戏相对狭窄的目标用户范围，游戏的用户获取可能会较昂贵，所以执行者需要想办法提高ARPU。

在设计会议期间，有人建议改变UI从而让玩家可以在主菜单上先于赛车页面看到升级页面。但是在经过一些讨论后，团队分裂成两派。一派人认为这是一个很好的主意，这能够通过完善升级页面的可见性而提高ARPU，这是游戏内部货币的一个槽点。另一派人则持反对意见；他们认为降低赛车页面的可见性将减少用户参加比赛的次数，并因此导致他们使用较少的行动点，这也是游戏内部货币的另一个重要槽点。

AB Testing(from kuqin)

团队该如何解决这一分裂呢？我们该如何判断哪个决定才是正确的？我们不约而同想到了A/B测试。所以程序员接受了任务并将通过DLC去改变菜单流。现在50%的全新和当前用户能够看到突出升级页面的新菜单，而另外50%用户则还是看到突出赛车页面的旧菜单。在3周时间里，设计团队将得到答案，也有可能不能。

在DLC发生改变的2天前，市场营销团队改变了他们的用户获取策略，即在第一列国家中使用广告商与更高的CPI报价的结合方法。基于这一改变，在测试一周时间里程序团队便修改了一个可能影响新赛车速度的服务器漏洞。最终，可能是因为遇到假期，测试结果显示出改变能够促进游戏中的所有IAP。

假设我们拥有足够的用户，所以改变菜单的影响将趋于正常：这便是所谓的大数定律。然而不管是在黑色星期五的销售还是第一列国家用户的更高比例，这都会影响到用户所做出的IAP决定以及哪个游戏内部货币槽更有效。缩短多少新赛道的下载时间能够提高该功能的用户粘性？在测试前我们还很难去判断DLC菜单改变是否真能影响ARPU或用户行为。实际上我们很难单独量化这些具有巧合性的改变对于游戏的影响。

从核心看来，一个全新游戏功能的任何A/B测试与基于一种疗法和对照组而进行心理实验一样，即之后都将尝试着决定实验操作是否具有显著的影响。作为心理实验室中的一名研究人员，你能够通过仔细衡量去确保你的操作是你的疗法与对照组间的唯一区别。作为一名游戏分析师，你的实验小组则是超出你的控制范围的各种元素的不幸牺牲者。

所以这样的测试留给了我们什么？在这里我的要点并不是A/B测试是无效的，也不是说受数据影响的决策不适合免费手机游戏生态系统。相反地，我建议在匆忙选择A/B测试之前，分析师和设计师应该仔细考虑实现真正的实验控制的代价（游戏邦注：也就是牺牲能够创造出其它游戏改变的能力）或者承担使用错误数据去做出决定的风险。

（本文为游戏邦/gamerboom.com编译，拒绝任何不保留版权的转发，如需转载请联系：游戏邦）

Losing Control: A/B Testing in Mobile Games

by Andrew Kope

I hear the phrase “A/B testing” on an almost daily basis. It’s often touted as a cure-all for game design decision making – remove personal bias from the equation, and make data-driven decisions because “the numbers don’t lie” (like in Mark Robinson’s article here). Now I’m not saying that A/B testing can’t work, or can’t be effective… but as with a lot of things which cross my desk, the devil is in the details.

Consider the following: We have published a F2P racing game, where users earn soft currency by completing races, with new cars/upgrades costing soft currency to purchase. You can enter only 10 races per day each costing one ‘action point’, with the option to buy more action points or more soft currency via IAP. User retention is good, but maybe UA is a little pricey given the game’s relatively narrow target audience, so the execs are looking for a way to improve ARPU.

During a design meeting, the suggestion is made to change the UI so that the upgrade screen is visible ahead of the currently prominent race screen in the main menu… but after some discussion, the team divides. One side thinks this is a great idea; it will improve the ARPU by improving the visibility of the upgrade screen, a sink for in-game currency. The other side disagrees; downgrading the visibility of the race screen will make users run fewer races and therefore use up less of their action points, another important sink for IGC.

How does the team resolve this debate? How can we really know which decision is right? Invariably, the suggestion is made to A/B test. So, the programmers go to task and a change to the menu flow is made via DLC. 50% of new and existing users will now see the new menu highlighting upgrades, and the other half will see the old one highlighting races. In three weeks, the design team will have their answer… or maybe not.

Two days before the DLC change, the marketing team changed their UA strategy to use a different mix of advertisers and upped CPI bids in Tier 1 countries. Coupled with that, a week into the test the programming team fixed a server bug that was slowing down the download speed of new racetracks. Finally, during the two weeks since the change maybe we had a holiday like Thanksgiving, which prompted an in-game sale on all IAPs for Black Friday.

The assumption is that with enough users, the effects of everything but the menu change will even out: the so-called Law of Large Numbers. However, both the Black Friday sale and the higher ratio of Tier 1 users might have affected which IAPs users are making and thus which in-game currency sinks were most accessible. And who’s to say how much shortening the download time of new racetracks might have improved engagement with that feature? Now it seems a lot harder to argue that the DLC menu change is the real cause of any changes to ARPU or user behavior than it did before the test. In reality, there is almost no way to individually quantify how much of an effect these coincidental changes might have had on the game.

At its core, any A/B test of a new game feature is really akin to running a psychological experiment with a treatment and control group, and then trying to determine if there was a statistically significant effect of the experimental manipulation. As a researcher in a psychology lab, you can take deliberate measures to ensure that your manipulation is the only difference between your treatment and control groups. As a game analyst, your experimental groups are the unfortunate victims of a whole host of factors outside of your control.

So where does that leave us? My point here is not that A/B testing can’t be done, nor is it that data-driven decisions aren’t the way to go in the F2P mobile ecosystem. Instead, I’m suggesting that before rushing to suggest another A/B test, both analysts and designers should consider the cost of achieving real experimental control – namely sacrificing the ability to make almost any other changes to the game – or run the risk of trying to make good decisions with bad data.（source：Gamasutra）

分享到： QQ空间新浪微博开心网人人网

上一篇:如何选择游戏设计中的增加内容与补充内容

下一篇:面向不同人推销你的理念的方法