游戏邦在:
杂志专栏:
gamerboom.com订阅到鲜果订阅到抓虾google reader订阅到有道订阅到QQ邮箱订阅到帮看

阐述统计学在游戏设计领域的应用

发布时间:2012-10-03 08:29:02 Tags:,,,,,

作者:Tyler Sigman

本文主要摘选了一些游戏设计者需掌握的统计学话题。特别对于系统设计师、机械设计师、平衡设计师等设计领域的设计师来说,统计学着实有用且很重要。(请点击阅读本系列第一篇

虽然统计学是一门基于数学的学科,但是它实在很枯燥!严格地说——如果你曾经不得不大量地研究双边置信区间、学生T检验以及卡方分布测试,有时你会觉得很难消化这些知识点。

一般来说,我是喜欢物理学和力学的,因为很多时候只需简单地分析一个事例,你就能核实现状。当你计算苹果从树上落下的速度及方向时,如果你的结果是苹果应以每小时1224英里垂直向上抛出,也就是实际上你已经在头脑中核实过结果了。

统计学的优势在于易理解且具合理性;而劣势在于它的奇特性。无论如何,这篇文章的话题不会让你觉得枯燥。因为大部分的话题都是有形的、属于重要的数据资料,你应有精力去慢慢摸索。

statistics(from wired.com)

statistics(from wired.com)

统计学:黑暗的科学

统计学是所有学科领域中最易被邪恶势力滥用的科学。

统计学可以同邪恶行径相比较是因为在使用不当时,这门学科的分支就会被推断出各种无意义或者不真实的裙带关系(参见本文末尾的实例)。如果政治家或其它非专业人士掌控了统计学,那么他们就可以操纵一些重要决定。一般来说,基于错误总结的坏决策从来不受好评。

也就是说,使用得当时,统计学无疑非常有用且有益。而对于强权势力者来说,他们会将统计学应用于一些非法途径,甚至是一些纯粹无用的渠道。

统计学——所谓的争议

我已准备好作一个紧凑的总结,然而我注意到维基百科已经对统计学作了定义,而且语言几近诗歌体系。如下:

统计学是应用数学的一个分支,主要通过收集数据进行分析、解释及呈现。它被广泛应用于各个学科领域,从物理学到社会科学到人类科学;甚至用于工商业及政府的情报决策上。(Courtesy Wikipedia.org)

这真的是一段很感人的文章。特别是最后那句“用于情报决策上”。

当然,作者忘记添上“在游戏设计领域”,但是我们原谅他对这一蓬勃发展的新兴行业的无知。

以下为我自己撰写:

统计学是应用数学的一个分支,它涉及收集及分析数据,以此确定过去的发展趋势、预测未来的发展结果,获得更多我们需了解的事物。(Courtesy Tylerpedia)

如果将此修改为适用游戏设计领域,那可以如此陈述:

统计学为你那破损的机制及破碎的设计梦指引了一条光明大道。它为你有意义的设计决策提供了稳定且具有科学性的数据。

须知的事实

统计学同其它硬科学一样深奥且复杂。如同第一部分的内容一样,本文只涉及一些精选的话题,我自认为只要掌握这些就足够了。

再次突击测验

很抱歉我要采取另一项测试了。别讨厌出题目的人,讨厌测试吧。

Q1a)假设有20名测试员刚刚完成新蜗牛赛跑游戏《S-car GO!》中的一个关卡。你得知完成一圈的时间最少为1分24秒,最多为2分32秒。你期望的平均时间为2分钟左右。请问这个测试会成功吗?

Q1b)在同一关卡中你收集了过多的数据,在分析后得出这样的结果:平均值=2分5秒;标准差=45秒。请问你会满意这个答案吗?

Q2)你设计了一款休闲游戏,不久就要发行。在最后的QA阶段,你分布了一个测试版本,然后收集了所有的数据作为试验对象。你记录了1000多位玩家的分数,还有100多位特殊的玩家的分数(有些玩家允许重复玩游戏)。运算这些数据可知平均分为52000pts,标准差为500pts。请问这游戏可以发行了吗?

Q3)你设计了一款RPG游戏,然后收集数据分析新的玩家从关卡1到关卡5的游戏进程会有多快。收集的数据如下所示:4.6小时、3.9小时、5.6小时、0.2小时、5.5小时、4.4小时、4.2小时、5.3小时。请问你可以计算出平均值和标准差吗?

总体和样本

统计学的基础为分析数据。在分析数据的时候,你需要了解两个概念:

1.总体:

总体是指某一领域中所有需要测量的对象。总体是抽象的,只在你需要测量时候才会具体化。比如,你想了解人们对某一特定问题的看法。那你就可以选择地球上所有的人,或者爱荷华州所有的人或者只是你街道附近所有的人作为一个总体。

2.样本:

样本实际上就是指抽取总体中部分用于测量的对象。原因很明显,因为我们很难收集到所有总体的数据。相对来说,你可以收集部分总体的数据。这些就是你的样本了。

正确性及样本容量

统计学结果的可靠性通常由样本容量的大小决定。

我们完美的想法是希望样本容量就是我们的总体——也就是说,你想整个收集全部涉及到的数据!因为样本越少,你就需要估计可能的趋势(这是一种数学性的推断)。而且,数据点越多越好;你最好能建立一个大型的总体而不是小型的。

例如,相对于调查10000个初中生对《Fruit Roll-Ups》的感想,试想下调查人员能否询问到每一个学生。100万个的数目过于庞大,做不到的话,10万个也不错。仍然做不到,好吧,10000个刚刚好。

由于时间和费用的关系,通常呈现出的研究结果都是基于样本所做的调查。

1.统计学的常识性规则:

你无法通过一个数据点来预测整个趋势。如果你知道我喜欢巧克力冰淇淋,你不能总结所有的Sigmans都喜欢巧克力冰淇淋。如果现在你询问我家庭中的许多成员,然后你可能会得出关于他们的想法这类比较合理的结论,或者你至少知道是否能总结出一个合理的推断。

广泛的分布图(重点!)

由于种种原因,只有《The Big Guy》可以解释生活中的许多事情倾向于同一模式发展或者分布。

最普遍的分布也有一个合理的名称——“正态分布”。是的,无法匹配这一分布图的都为非正态,所以有点怪异(需要适当避免)。

正态分布也称“高斯分布”,主要因为“正态”一词听起来不够科学。

正态分布也称为“钟形曲线”(又称贝尔曲线),因为其曲线呈钟形。

bell curve(from gamasutra)

bell curve(from gamasutra)

钟形曲线的突出特点是大多数的总体均分布在平均值周围,只有个别数据散落在一些极限位置(主要指那些偏高或偏低的数据)。中间成群的数据构成了钟的外形;而那些偏高数据或偏低数据分布在钟的边缘。

我们周围有上百万的不同事例呈现出正态分布的景象。如果你测量了你所生活的城市中所有人的身高,结果可能呈现正态分布。这表明,只有少数个体属于非正常的矮,少数个体属于姚明那样的身高,而大多数人会比平均身高多几英寸或者矮几英寸。

钟形曲线同样极典型地适用于调查人们的技能水平。以运动为例——极少部分人在这一领域为专业人士,大多数的人都还过得去,只有少部分的人实在不擅长,所以没有被选为队员(比如我)。

其它分布图

尽管正态分布图很完美,但它并非我们周围唯一的一种分布图。只是它比较普遍地存在。

比如有些其它的分布图直接与赌博及游戏设计有关,只要看下扔骰子的概率分布图,这种情况下出现了如下的d6情形及2d6情形:

D6 distribution(from gamasutra)

D6 distribution(from gamasutra)

2d6 distribution(from gamasutra)

2d6 distribution(from gamasutra)

现在我想说的是第一个分布图看起来一点也不像钟形曲线,而第二幅图开始呈现出了钟的形状。

平均值

这一小块内容可以说是这篇冗长的文章中的一个小插曲。这块自我指涉的小内容的存在只有一个目的:提醒你什么是“平均值”。这块自我指涉且迂腐的小内容将被动地提醒你平均值是指一整套的数学平均数据。

方差和标准偏差

我们必须理解什么是方差和标准偏差,并且它们也具有许多有形的价值。除了能够帮助我们做出有价值的数据总结外,这两个术语还能够帮助我们更明智地陈述分布问题。比起说“中间聚集了大量的数据点”,我们可以换个说法,即“68.2%的样本是一个平均值的标准偏差”。

sigman(from gamasutra)

sigman(from gamasutra)

方差和标准偏差是相互联系的,它们都能够测量一个元素,即分散数据。直观地说,较高的方差和标准偏差也就意味着你的数据分散于四处。当我在投掷飞镖时,我便会获得一个较高的方差。

我们可以通过任何数据集去估算方差和标准偏差。我本来应该在此列出一个方程式的,但是这似乎将违背“听起来不像是一本教科书”的规则。所以我这里不引用公式,而是采用以下描述:

标准偏差:样本或人口统计的平均数值偏离平均值的程度。由希腊之母σ(sigma)表示。

举个例子来说吧,你挑选了100个人并测试他们完成你的新游戏第一个关卡分别用了多长时间。让我们假设所有数据的平均值是2分钟30秒而标准偏差则是15秒。这一标准偏差表明游戏过程中出现了集聚的情况。也就是平均来看,每个游戏过程是维持在平均值2.5分钟中的±0.25分钟内。从中看来这一数值是非常一致的。

这意味着什么以及为何你如此在乎这一数值?答案很简单。假设你不是获得上述结果,而是如下结果:

平均值=2.5分钟(如上)

σ=90秒=1.5分钟

所以我们现在拥有相同的平均值以及不同的标准偏差。这套数值表明玩家所用的游戏时间差别较大。90秒钟的游戏时间背离了平均游戏时间。而因为游戏时间是2.5分钟,所以这种偏差过大了!基于各种设计目的,出现这种较大的差值都不是设计师想看到的结果。

而如果我们所说的游戏时间是15分钟而标准偏差是90秒(1.5分钟)的话差别变更大了。

通过一个小小的标准偏差便能够衡量一致性。标准偏差比率除以平均值便能够获得相关数值。就像在第一个例子中,15秒/150秒=10%,而在第二个例子中,90秒/150秒=60%。很明显,60%的标准偏差真是过大了!

但是并不是说较大的标准偏差“总是”糟糕的。有时候设计师在进行测量时反而希望看到较大的标准偏差。不过大多数情况下还是糟糕的,因为这就意味着数值的差异性和变化性较大。

更重要的是,标准偏差的计算将告诉你更多有关游戏/机制/关卡等内容。以下便是通过测量标准偏差能够获得的有用的数据:

1.玩家玩每个关卡的游戏时间

2.玩家玩整款游戏的游戏时间

3.玩家打败一个经典的敌人需要经历几次战斗

4.玩家收集到的货币数量(游戏中有一个意大利水管工)

5.玩家收集到的吊环数量(游戏中有一个快速奔跑的蓝色刺猬)

6.在教程期间时间控制器出现在屏幕上

误差

误差与统计结论具有密切的关系。就像在每一次的盖洛普民意测验(游戏邦注:美国舆论研究所进行的调查项目之一)中也总是会出现误差,如±2.0%的误差。因为民意调查总是会使用样本去估算人口数量,所以不可能达到100%精准。零误差便意味着结果极其精确。当你所说的人口数量大于你所采取的样本数量,你便需要考虑到误差的可能性。

如果你是利用全部人口作为相关数据来源,你便不需要考虑到误差——因为你已经拥有了所有的数据!就像我问街上的任何一个人是喜欢象棋还是围棋,我便不需要考虑误差,因为这些人便是我所报告的全部数据来源。但是如果我想基于这些来自街上行人的数据而对镇上的每个人的答案做出总结,我便需要估算误差值了。

你的样本数量越大,最终出现的误差值便会越小。Mo data is bettuh(越多数据越好)。

置信区间

你可以使用推论统计为未来数据做出总结。一个非常有效的方法便是估算置信区间。理论上来看,置信区间与标准偏差密切相关,即通过一种数学模式去表示我们多么确定某一特定数据是位于一个特定范围内。

置信区间:即通过一种数学方法传达“我们带着A%的置信保证B%的数据将处于C和D价值区间。”

虽然这个定义很绕口,但是我们必须知道,只要具有一定的自信,我们便能够造就任何价值。让我以之前愉快但却缺乏满足感的工作为例:

我过去是从事应力分析和飞机零部件的设计工作。如果你知道,或者说你必须知道,飞机,特别是商业飞机的建造采用的是现代交通工具中最严格的一种形式。人们总是会担心机翼从机身上脱落下来。

作为飞机建造工程师,我们所采取的一种方法便是基于材料优势属性设置一个高置信区间。关于飞机设计的传统置信区间便是“A基值许可”,即我们必须95%地确信装运任何一种特殊材料都有99%的价值落在一个特定的价值区间内。然后我们将根据这一价值与可能发生的最糟糕的空气条件进行设计,并最终确立一个最佳安全元素。

当你真正想了解某种数据值时,置信区间便是一种非常有帮助的方法。幸运的是在游戏中我们并不会扯到生死,但是如果你想要平衡一款主机游戏,你便需要在设计过程中融入更多情感和直觉。计算置信区间能够帮助你更清楚地掌握玩家是如何玩你的游戏,并更好地判断游戏设置是否可行。

不管你何时想要计算置信区间,备用统计规则都是有效的:越多数据越好。你的样本中拥有越多数据点,你的置信区间也就越棒!

你不可能做到100%的肯定

这便引出了另一个统计规则:

并不存在100%之说:你永远不可能创造一个100%的置信区间。你不可能保证通过推论统计便能够预测一个数据点具有一个特定的价值。

当玩家在《魔兽世界》中挑战任务时,唯一可以确定的只有死亡,税金以及不可能找到最后的Yeti Hide。所以玩家只需要接受这些事实并勇往直前便可。

滥用

我在之前提过,统计是一种邪恶的技能。为了更好地解释原因,我写下了这篇弹头式爱情诗:

十四行诗1325:美好的统计,让我细数下我滥用你的每种方式:

1.误解

2.未明确置信区间

3.只因为不喜欢而丢弃了有效的结论

4.基于有缺陷的数据而做出总结

5.体育实况转播员的失误——混淆了概率和统计错误

6.基于一些不相干元素做出总结

误解

人们一直在误解统计报表。我知道,这一点让人难以置信。

未明确置信区间或误差

置信区间和误差是信息中非常重要的组成部分。在过去30天内有43%的PC拥有者购买了一款可下载的游戏(误差为40%)与同样的陈述但存在2%的误差具有巨大的差别。而如果遗漏了误差,便只会出现最糟糕的情况。我们需要始终牢记,小样本=高误差。

只因为偏见而丢弃了有效的结论

操作得当的话,统计数据是不会撒谎的。但是人们却一直在欺骗自己。我们经常在政治领域看到这类情况的出现,人们总是因为结论不符合自己预期的要求而忽视统计数据。在焦点小组中亦是如此。当然了,政治领域中也常常出现滥用统计结论的现象。

基于有缺陷的数据而做出总结

这种情况真是屡见不鲜,特别是在市场调查领域。你的统计结果总是会受到你所获得的数据的影响。如果你的数据存在缺陷,那么你所获得的结果便不会有多少价值。得到有缺陷的数据的原因多种多样,包括失误和严重的操作问题等。提出含沙射影式问题便是引出能够支持各种结论(就像你所希望的那样)的缺陷数据的一种简单方法。“你比较喜欢产品X,还是糟糕的产品Y?”将快速引出反弹式回答,如“95%的费者会选择产品X!”

体育实况转播员的失误

体育实况转播员可以说是当今时代的巫医。他们会收集各种统计,概率以及情感,然后将其混合在一起而创造出一些糟糕的结果。如果你想看一些围绕着没有根据的结论的统计,你只要去观看一款足球比赛便可。

例如一个广播员会说“A队在最后5局游戏中并未阻止B队的进攻。”这种模糊的结论是关于A队不大可能阻止B队的进攻,而不是他们在最后5局游戏中成功阻拦了B队。但是你也可以反过来说——也许他们将会这么做,因为他们之前从未阻挡过任何对手。

但是事实却在于根本不存在足够的信息能够支持任何一种说法。也许这更多地取决于一种概率。阻挡进攻的机会是否就取决于一方在之前的游戏中是否这么做过?它们也许是两种相互独立事件,除非彼此间存在着互相影响的因素。

但是这并不是说所有体育运动的结论都存在着缺陷。就像对于棒球来说统计数据便非常重要。有时候统计分析也将影响着球的投射线或者击球点等元素。

最终还是取决于数据:当你拥有足够的数据时,你便能够获得更好的统计结论。棒球便能够提供各种数据:每一赛季大约会进行2百多场比赛。但是足球比赛的场次却相对地少了很多。所以我们最终所获得的误差也会较大。但是我并不会说统计对于足球来说一点用处都没有,只是我们很难去挖掘一些与背景相关的有用数据。

基于一些不相干元素做出总结

人们始终都在误解统计报表。比起使用对照关系,我们总是更容易推断出一些并不存在的深层次的关系。我最喜欢的一个例子便是著名的飞行面条怪物信仰(游戏邦注:是讽刺性的虚构宗教)的《Open Letter to the Kansas School Board》中的“海盗vs.全球变暖”图表:

http://www.venganza.org/about/open-letter/

我们是否能够开始解答问题了?

问题1的答案—-关卡时间

这一问题的答案很简单:你未能获得足够的信息去估算平均值。因为在1:24与2:32范围中波动的价值并不意味着它们的平均值就是2分钟。(单看这两个数值的平均值是1.97分钟,但是我们却不能忽视其它18个结果!)你必须掌握了所有的20个结果才能估算平均值,除此之外你还需要估算标准偏差值。

问题2的答案—-后续关卡时间

这时候你可能不会感到满足,因为标准偏差值过高了,超过平均值的40%。如此看来你的关卡中存在着过多变量。同时这里也存在着一些可利用的潜在元素,并且技能型玩家能够发挥其优势而造福自己。或者,你也可以严厉惩罚那些缺少技能的玩家。而作为游戏设计师,你最终需要做的便是判断这些结果(居于高度变量)是否符合预期要求。

问题2的答案—-标准偏差值

统计只是你所采用的一种方法,你同时还需要懂得如何进行游戏设计。如此,过于接近的计数分组使得我们总是能够获得一个较低的标准偏差值(500/52000=1%),这就意味着你所获得的分数几乎没有任何差别,也就是说在最终游戏结果中玩家的不同技能并不会起到任何影响作用。而当玩家发现自己技能的提高并不会影响游戏分数的发展时,便会选择退出游戏。

所以在这种情况下你更希望看到较高的标准偏差,如此游戏分数才能随着技能的提高而提高。

问题3的答案—-游戏时间

可以说这是一个很难获取的数值,不过它却说明了数据收集中的一个要点:你需要警惕那些看起来是错误的数据。就像0.2小时看起来就有问题。也许这是排印错误,或者是设备故障所造成的,谁知道呢。但是不管怎样在进行各种计算之前你都需要坚定不移地说服自己0.2小时是一个有效数据,或者你也可以选择将其丢弃而基于剩下的数据点进行估算。

其它有趣的内容

为了控制本文篇幅,我不得不略过许多有趣的主题。我只要在此强调理解统计不仅能够帮助你更好地进行游戏设计,同时也能够帮助你做出消费者决策,投票决策或者财政决策等。我敢下23.4%的赌注保证我所说的内容中至少有40%的内容是正确的。

对于设计师而言,统计能够帮助他们获取来自有记录的游戏过程(样本)的相关数据,并帮助他们为更大的未记录的游戏过程(人口统计)做出总结。

在实践中学习

例如在我刚完成的游戏中,我便是通过记录游戏过程的相关数据,并围绕着源自这些数据的平均值和标准偏差去设定游戏挑战关卡。我们将中等难度等同于平均值,较容易的等同于平均值减去一定量的标准偏差,而较困难的等同于平均值加上一定量的标准偏差。如果我们能够收集到尽可能多的数据,我们的统计便会越精准。

就像概率论一样,当你的项目范围变得越来越大时,统计也会变得越来越有帮助。很多时候你可以通过自己的方法进行摸索,而无需使用任何形式理论。但是随着游戏变大,用户群体的壮大以及预算的扩大,你便需要做好面对一个不平衡,且完全凭直觉的游戏设计中存在固有缺陷的准备。

你需要牢记的是,统计和概率都不可能为你进行游戏设计,它们最多只能起到辅助作用!

游戏邦注:原文发表于2007年1月24日,所涉事件和数据均以当时为准。(本文为游戏邦/gamerboom.com编译,拒绝任何不保留版权的转载,如需转载请联系:游戏邦

Statistically Speaking, It’s Probably a Good Game, Part 2: Statistics for Game Designers

by Tyler Sigman

If you’re reading this, then chances are you also read Part 1, “Probability for Game Designers.”

If you haven’t read it, you really should, and that’s not to say it is full of good stuff (the article is tripe, actually). I just recommend reading it because if you don’t, you might be unprepared for the silliness that may ensue during this serious *ahem* and erudite *cough* discussion of statistics.

This article focuses on a few select statistical topics that I believe should be understood by game designers. In particular, statistics really is useful and important for system designers, mechanicians, balancers, and other subclasses of designer that are usually relegated to steerage.

Disclaimer taken care of, let’s move on to the fizzy stuff!

Statistics: A Two-Drink Minimum Science

Although heavily grounded in mathematics, statistics is…well…weird! Seriously – if you ever have to start dealing heavily in two-sided confidence intervals and Student’s T-tests and chi-squared tests (or anything else squared, for that matter), it can get a little hard to digest at times.

You see, people like me really prefer physical metaphors. I’ve always liked physics and mechanics, because a lot of the time you can give yourself a reality check simply by analyzing reality. When you’re calculating the rate and direction at which an apple falls from a tree, you can reality check it in your head if your result says the apple should shoot off straight upward at 1,224 MPH.

At its best, statistics is understandable and rational; at its worst, it’s a little strange. Hence, I recommend libations and togas for any involved statistics discussion. I have asked the fine editors at Gamasutra to provide such togas and an open digital bar. What, didn’t you get your passcode? Hmmm, weird.

In any case, the topics in this article aren’t weird at all. For the most part, they are tangible, crunchy bits of statistics that you can develop gut feels for.

Statistics: The Dark Science

Statistics is, of all the sciences, the one that is very prone to misuse by the Forces of Evil. That is, if you had to attribute one science to the villain you are creating for your new book (you are writing a book, aren’t you?), you could do much worse than pick statistics. You could also give him a cape, dress him in black, and refer to him as “The Spider” or “Mr. Jones”, but I digress.

The reason that statistics can be loosely compared to villainy is that, used improperly, this branch of science can be called upon to infer all sorts of relationships that aren’t actually meaningful or even true (see the end of this article for an example of what I mean). When in the hands of politicians and other ne’er do wells, this can guide big decisions. Big decisions based upon inaccurate conclusions are never good.

All this is to say, statistics is incredibly useful and helpful when used properly. But like any stuperpower, it can be applied in nefarious ways, or even just plain dumb ways.

Statistics – What’s All The Fuss About?

I was going to crack my knuckles and write a tight summary, but then noticed that Wikipedia already had something that was darn near poetry. Here it is:

Statistics is a mathematical science pertaining to the collection, analysis, interpretation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities; it is also used for making informed decisions in all areas of business and government. (Courtesy Wikipedia.org)

That’s actually a very moving passage. In particular, the last bit is the tour de force of the paragraph:

…it is also used for making informed decisions…

Of course, the writer forgot to add “in game design,” but we can forgive him his condescension towards our burgeoning industry.

Here’s my own try:

Statistics is a mathematical science that deals with collecting and analyzing data in order to determine past trends, forecast future results, and gain a level of confidence about stuff that we want to know more about. (Courtesy Tylerpedia)

And if I were to modify it for Game Design, I would say (and am, in fact, saying):

Statistics can help you shine a flashlight upon your broken mechanics and shattered design dreams. It does this by giving you actual hard, scientific data to support meaningful design decisions.

What Do We Need to Know?

Statistics, like any hard science, is deep and complex. Like the tour of Probability in Part 1, this article only touches on a few selected topics that I, in my unlimited hubris, have deemed Important Enough to Know®. (Yep – unlike the many TMs I throw around, this one is so potent it’s registered!)

Pop Quiz Again

I’m sad to say that I have resorted to another test. Don’t hate the Quizza, hate the Quiz.

Q1a) Focus testers have just finished playing through a level in your new snail racing game “S-car GO!” Twenty testers played, and your are informed that the lap times came back in a range from 1 min 24 seconds at the low end to 2 min 32 seconds at the high end. You were expecting an average time of 2 minutes or so. Was the test a success?

Q1b) You collect more data for the same level, do some analysis, and find that the stats are: mean = 2 min 5 sec, standard deviation = 45 sec. Should you be satisfied?

Q2) You design a casual game that will surely soon be the talk of soccer moms everywhere (an admirable goal). In final QA, you release a beta build and then take data on a whole bunch of trial sessions. Over 1,000 play sessions are recorded, with over 100 unique players (some players were allowed to play repeated sessions). Crunching the data shows a mean score of 52,000 pts with a standard deviation of 500 pts. Is the game tuned up enough to release?

Q3) You design an RPG, and then collect data on how fast it takes new players to progress from level 1 to level 5. The data comes in as follows: 4.6 hrs, 3.9 hrs, 5.6 hrs, 0.2 hrs, 5.5 hrs, 4.4 hrs. 4.2 hrs, 5.3 hrs. Should you calculate the mean and standard deviation?

Populations and Samples

The base of statistics is the analysis of data. When dealing with data, there are two main terms that you need to know:

1.Population: the entirety of a field for which measurements are to be taken. The population is arbitrary, and is dependent only on what you wish to measure. For example, say you want to know what people think about a particular issue. Your chosen population could be all of the people on earth, all of the people in Iowa, or just all the people on your street.

2.Sample: a portion of the population for which measurements are actually taken. For very obvious reasons, it’s often too hard to gather data for an entire population. Instead, you gather data for a portion of the population. This is your sample.

Accuracy and Sample Size

The strength of a statistical conclusion is extremely sensitive to the size of your sample.

In a perfect world, you’d always like your sample size to be equal to your population–that is, you want to collect data on the entirety of whatever matters to you! Because anything less means you have to infer trends (a mathematical inference, but an inference nonetheless). Furthermore, the more data points, the better; you’d rather have a giant population than a tiny one.

Marketers and politicians would give their left brains to get a sample that is equal to their (large) population of interest. For example, instead of polling 10,000 junior high school kids to get an idea of how they feel about Fruit Roll-Ups®, imagine if they could poll *every junior high school kid*. Failing that, polling 1,000,000 would be super. Failing that, 100,000 would be dang nice. Failing that…okay, 10,000 will do.

It is for reasons of time and money that studies are performed on samples rather than entire populations.

1. The Common Sense Rule of Statistics: mo is bettuh

You can’t predict a trend with one data point. If you know I like chocolate ice cream, you can’t draw any meaningful conclusions about what all Sigmans like. Now if you ask many members of my family, then you might be able to draw a reasonable conclusion about what the rest think…or at least know *whether* you can draw a reasonable conclusion. Ain’t stats fun?

Population Explosions and Wide Distributions (BEEP! BEEP!)

For reasons that only The Big Guy can explain, many things in life tend to follow similar patterns, or distributions.

One of the most common is the aptly-named “normal distribution.” That’s right, anything not matching this is abnormal, and therefore weird (and should be shunned appropriately).

The normal distribution is also known as a “Gaussian” distribution, primarily because “normal” doesn’t sound scientific enough.

The normal distribution is also commonly called a “bell curve” because, well, just look at the durned thing, will ya!?

The distinguishing characteristics of a bell curve distribution are that most of the population are clustered closely around the mean, or average, value, and comparatively few are scattered at the extremes (high or low). This middle-clustering leads to the bell-curve appearance; the highs and lows are the flange of the bell.

We see the bell curve around us in a million different things. If you measured the heights of all the people in your city, they’d probably match this distribution. That is, a tiny few would be super-abnormally short, a tiny few would be super-Yao Ming tall, and a great many would be within a few inches of the average.

The bell curve typically holds true whenever you are looking at people’s skill levels, too. Take sports – a tiny few are good enough to play professionally, a great many are good enough to get by, and a tiny few are so bad that they don’t get picked to be on teams (like me).

Other Distributions

The normal distribution, despite being swell, isn’t the only distribution around. It’s just amazingly common.

For examples of some additional distributions that are directly related to gaming and game design, just take a look the probability distributions of dice throws, in this case a d6 and then a 2d6 throw:

In part 3 of this series, which should hit Gamasutra shelves around 2010, I’m going to spend a bunch more time talking about these dice distributions. For now, all I’m going to say is that the first one looks nothing like a bell curve, whereas the second throw is starting to resemble one (but still isn’t quite there yet).

Means to an End

Consider this tiny section an intermission embedded within an otherwise tedious article. This tiny, self-referential section serves only one purpose in life: to remind you of what a “mean” is. This tiny, self-referential, and pedantic section would like to passively remind you that a mean is the mathematical average of a set of data.

This tiny, self-referential, pedantic, passive, and well-meaning section hopes that you take something meaningful away from reading it; for it is now that this tiny, self-referential, pedantic, passive, and pun-throwing paragraph must end.

Variance and Standard Deviation

Variance and standard deviation are very important to understand, and have a lot of tangible value. Aside from helping us draw valuable statistical conclusions, these terms enable us to speak a lot more intelligently about distributions. Instead of saying “a great many data points cluster about the middle”, we can say “68.2% of the sample falls within one standard deviation of the mean.” Chicks dig that speak; guys dig that speak; heck, who doesn’t dig that speak?

Variance and standard deviation are related to each other, and they both measure the same thing: data scatter. Intuitively, a high variance or standard deviation means your data is all over the place. When I play darts, I get a high variance in my throws.

Variance and standard deviation can be easily calculated from any set of data that you have. I’d put the equations in here, but that would break my “don’t sound like a textbook” rule. So instead of an equation, here’s a description:

Standard Deviation: the average amount by which data points in the sample or population differ from the mean. Standard deviation is represented by the Greek letter σ (sigma)

In other words, say you test 100 people on how long it takes them to complete Level 1 in your newest game. Let’s assume the average (mean) of all the data is 2 minutes 30 seconds. Now assume the standard deviation calculates out to be 15 seconds. This standard deviation indicates that the grouping or “clumping” of the play sessions. In this case, it’s saying that on average, play sessions are within ±.25 minutes of 2.5 minutes. That’s pretty consistent.

What does this mean and why do you care? Easy. Pretend that instead of the above results, you got these results:

Mean = 2.5 minutes (same as above)
σ = 90 seconds = 1.5 minutes

So here we have the same mean but a vastly different standard deviation. This set of numbers means that you have much more scatter in the play times. On average, play times are about 90 seconds off of the mean play time. Given that the mean play time is only 2.5 minutes, that’s huge! And it’s probably not good to have that much scatter, for various game design reasons.

It would be much different if you were talking about a standard deviation of 90 seconds (1.5 minutes) on play times of 15 minutes.

Consistency is measured by a small standard deviation. Ratio your standard deviation against your mean to get a good warm-fuzzy number. In the first example, 15 sec / 150 sec = 10%. In the second, 90 sec / 150 sec = 60%. A standard deviation of 60% is bigggggg with indulgently repeated g’s. In the third, 90 sec / 900 sec = 10% again…respectable.

This is not to say that a large standard deviation is *always* bad. Sometimes as designers we want a large standard deviation in whatever we’re measuring. But a lot of times it’s bad, because it represents a lot of scatter and variability.

The important thing is that calculating standard deviation will tell you a lot about your game/mechanic/level/etc. Examples of useful things to measure standard deviation for:

1.Level play times

2.Whole-game play times

3.Number of combat rounds it takes to defeat a typical enemy

4.Number of coins collected (games with small Italian plumbers)

5.Number of rings collected (games with fast, blue hedgehogs)

6.Times controller is thrown at screen during your tutorial

Margins of Error

Margins of Error go hand in hand with statistical conclusions. Think of every Gallup Poll you’ve ever seen; there is always a margin of error expressed, such as ±2.0%. Because polls are using samples to estimate a population, there can never be 100% confidence (see later in the article). Margin of Err.0or indicates how accurate the results are. It is absolutely vital to know Margin of Error whenever you are talking about a population bigger than your sample.

If you take data on your entire population, then theoretically you don’t need a Margin of Error – you already know all the data! For example, if I ask everyone on my street whether they prefer Chess or Go, then I don’t need a Margin of Error as long as I am just reporting about people on my street. But if I want to draw a conclusion about everyone in my town based upon the data points from my street, then I have to calculate Margin of Error.

The bigger your sample size is, the smaller your Margin of Error. Mo data is bettuh.

(Self-)Confidence Intervals

You can use inferential statistics to draw conclusions about future data. One useful trick is the calculation of confidence intervals. Conceptually, confidence intervals are closely related to standard deviation, and are basically a mathematical way of saying how certain we are that a given piece of data will fall in a specified range.

Confidence interval: a mathematical way of saying “we can guarantee with A% confidence that B% of the data will be between values C and D.”

That’s a mouthful. But it’s useful to know, with a specified amount of confidence, what a value is likely to be. For a good example, I’m going to step back into my previous career for a blissful yet ultimately unsatisfying moment:

I used to do stress analysis and design of aircraft bits and bobs. If you know, or need to know, anything about aircraft – and commercial aircraft in particular – it’s that it is the most regulated form of transportation that exists. People don’t like it when wings fall off of planes. ‘nuff said.

One of the methods we engineers use to keep said wings on said planes is designing to a very high confidence interval of material strength properties. A typical confidence interval used for aircraft design is the “A-basis allowable”, which means we are 95% confident that 99% of the values in any given shipment of a specified material fall above a certain value. Then, we design to that value against the worst possible air conditions, and then finally apply a big factor of safety on it. Gotta be sure.

Confidence intervals are very informative and useful whenever you *really want to know* what kind of data values to expect. Fortunately, games are not typically a matter of life and death, but if you are trying to balance an (unpatchable) console game, you probably want to have more than gut feel and intuition to go on. Calculating confidence intervals could be used to give you hard facts about how your game plays, and whether there are obvious exploits.

Whenever you want to calculate good confidence intervals, the ol’ standby rule of statistics still holds true: mo is bettuh. The more data points you have in your sample, the better your confidence interval calculation will be.

You Can Never Be Sure

This brings up another rule of statistics (and probability, actually):

100% Does Not Exist: You will never achieve a confidence interval of 100%. You can never guarantee through inferential statistics that a predicted data point will be of a certain specified value.

The only sure things in life are death, taxes, and the inability to find the last Yeti Hide you need when trying to complete a World of Warcraft quest. Accept these facts and move on.

Misappropriation

I mentioned earlier that statistics works as a skill of villainy. To illustrate why, I wrote this short, bullet-form love poem:

Sonnet 1325: Beautiful statistics, let me count the ways that I abuse and misuse you.

1.Misunderstanding

2.Not stating confidence intervals

3.Discarding valid conclusions because you don’t like them

4.Drawing conclusions based upon flawed or influenced data

5.Sportscaster errors – blending errors of probability and statistics

6.Drawing conclusions based upon unrelated factors

Misunderstanding

People misunderstand statistical statements all the time. I know, it’s hard to believe.

Not Stating Confidence Intervals or Margins of Error

Confidence intervals and margins of error are vital pieces of information. There is a huge difference between saying 43% of PC owners have purchased a downloadable game in the past 30 days (Margin of Error 40%) and the same statement with a MoE of 2%. When MoE is left out, always assume the worst. Remember, small sample = high MoE.

Discarding Valid Conclusions Because You Don’t Like Them

When used properly, statistics don’t lie. But people lie to themselves all the time. We see this a lot in politics, where statistical studies will be ignored simply because the conclusions don’t match those that were hoped for. Same thing sometimes happens with focus groups. Of course, we also see statistics misused terribly in politics, so it’s a wash, I guess.

Drawing Conclusions Based Upon Flawed Data

This one happens a lot, especially in market research. Your statistical conclusions are only as good as the data you make them from. If the data is flawed, then the conclusions are worthless. Flawed data can come in a variety of forms, with causes ranging from honest errors to severe manipulation. Asking loaded questions is one easy way to get flawed data that supports whatever conclusion you were hoping to make anyway. “Do you prefer Product X, or that crappy Product Y that only idiots use?” quickly leads to seemingly bullet-proof statements like “95% of consumers prefer Product X!”

Sportscaster Errors

Sportscasters are the shamans of our day. They take a little statistics, a little probability, a little gut feel and then mix them together to make something terrible. If you ever want to see a bunch of statistics thrown around with tenuous conclusions that typically have no basis, just watch a football game.

For instance, an announcer might say that “Team A hasn’t blocked a kick against Team B in the last 5 games.” The dangling conclusion is that Team A is less likely to block a kick than if they had done so in the last 5 games versus Team B. But you could say the same about the reverse–maybe they are more likely since they haven’t blocked one in a while!

The truth is, there isn’t enough information to say either one. And it’s probably more a matter of probability, anyway. Does the chance of blocking a kick really depend on whether one was blocked the game before? They are probably independent events, unless there are recognizable interrelated factors.

This is not to say that all sports conclusions are flawed. Statistics is very important to baseball, for instance. Statistical analysis sometimes guides what pitch is thrown or what the batting lineup will be.

It all comes down to data: when you have a lot of data, you get better statistical conclusions. Baseball supplies a lot of data: almost 200 games per season! With football, there almost just aren’t enough games to go around. Margins of Error are bigger. I’m not exactly saying statistics is never useful for football…it is just harder to mine useful, contextual data.

Drawing Conclusions Based Upon Unrelated Factors

People misunderstand statistical statements all the time. Rather, using compared relationships, it’s easy to infer deeper relationships that don’t actually exist. My all-time favorite example of this is the well-known Pirates vs. Global Warming graph featured in the CHURCH OF THE FLYING SPAGHETTI-MONSTER’S Open Letter to the Kansas School Board:

http://www.venganza.org/about/open-letter/

Please, for the love of all that is statistical, go look at the graph contained in that article. PLEASE, I BEG YOU!

Please, Can We Just Bookend the Quiz and Be Done?

Okay, okay, I hear you.

Q1a ANSWER – Level Times

The answer to this one is easy: you haven’t been given enough info to calculate the average yet. Just because the values ranged from 1:24 to 2:32 doesn’t mean they average out at 2 minutes. (Those two numbers average to 1.97 minutes, but we don’t know the other 18 results!) You need to know all 20 results to calc the average, and you really ought to calc the standard deviation as well…see below.

Q1b ANSWER – Level Times Part Deux

Okay, in this case you probably shouldn’t be satisfied because the standard deviation is pretty high…over 40% of the mean. This sounds like a bit too much variation in your level. There is potentially a sizable exploit that skilled players are using to their advantage. Alternatively, you might be punishing less-skilled players too much. As the game designer, you ultimately have to be the judge as to whether these results (high variation) are intended.

Q2 ANSWER – Soccer Moms

Stats only gets you part of the way there; you still need game design smarts. In this case, the score grouping is *way too close*…to have a standard deviation that low (500 / 52000 = 1%) means you are getting hardly any score variation, which means in turn that differences in player skill aren’t really mattering in the end game result. Therefore, players will most likely be turned off because they won’t see much of a progression in their scores as they get better at the game.

Here’s a situation where you’d really love to see a much higher standard deviation, because that hopefully shows that increased skill leads to increased scores. In other words, your current game scores the same no matter who plays it.

Q3 ANSWER – Play Times

This one is sorta tricky and underhanded but illustrates an important point about data collection: you need to watch out for obviously bad data. That one value, 0.2 hrs, looks suspiciously like an error. Could be a typo, could be an equipment malfunction, who knows. In any case, you should either convince yourself without a doubt that the 0.2 hrs is a valid data point before doing any calculations with it, or just throw it out and perform your calcs on the remaining data points.

Insert Other Cool Stuff Here

In efforts to keep this article under 723 pages, I have to skip over many other intriguing topics. Suffice it to say that a good understanding of statistics will help not only your game design, but your consumer decisions, voting decisions, and financial decisions. I’m 23.4% sure that at least 40% of what I just said is true.

As a designer, statistics is most useful when crunching data from a set of recorded play sessions (your sample), and trying to form conclusions about a larger field of unrecorded play sessions (your population).

Learn By Doing

For example, in the game I just finished, we recorded data from play sessions and then set challenge levels in the game based upon the mean and standard deviation values from those recorded data. We set Medium difficulty to be equal to the mean values, Easy difficulty to be equal to the mean minus a certain amount of standard deviations, and then Hard difficulty equal to the mean plus a certain amount of standard deviations. Had we collected much more data, it would’ve actually been accurate!

Just like probability theory, statistics becomes more and more useful the bigger and bigger the scope of your project. A lot of the time, you can fumble your way through without applying any formal theory in either case. But the bigger your game, the bigger your audience, and the bigger your budgets, then the more there is to risk from embedded flaws in an unbalanced, seat-of-the-pants designed game.

Stats, like probability, won’t do your game design work for you. It’ll just help you do it better!

The Long Road Ahead

In the rousing conclusion to this series, I’ll be taking bits from parts 1 and 2 and then putting them together in ways that actually have some relevance to games. Or I’ll croak trying! (source:GAMASUTRA)


上一篇:

下一篇: