分享回归分析模型在游戏运营中的使用方法

发布时间：2012-08-18 15:05:36 Tags：回归分析模型,回归分析软件,因变量,自变量

作者: Ted Spence

当你的游戏吸引到大量玩家后，你可以开始得到丰厚的回报了。现在，你的难题是怎么让这种成功延续下去。

你必须想出接触到用户的方法，计算出那类玩家可以从推广奖励中获益。所以，现在是时候设计一个分析数据的回归模型了。

regression-analysis(from shmula.com)

简介回归分析

我不想拐弯抹角了：回归分析的知识点很多。你需要了解的是，它是一种数学方法，是由某些史上最聪明的数学家发明的，包括高斯，他用这个方法预测行星的位置——所以这不是一个简单的领域啊。但在本文中，我只谈一些基本的用法。

首先，大部分公司都很容易就得出某些比率，比如：

“23%访问我们网站的人体验了游戏。”

“5.6%的玩家在游戏中消费。”

“大部分收益来自5%的消费玩家。”

在大多数时候，这种简单的算术已经够用了。

第一课：使用最简单最实用的工具

为什么这是第一课？因为复杂的工具很容易搞砸。Feynman（游戏邦注：美国物理学家，诺贝尔物理学奖获得者）曾经说过：“第一条原则是你绝对不要愚弄自己，因为你就是最容易被自己愚弄的人。”使用复杂的工具可能产生一些复杂而微妙的问题，很难预料和发现。

什么时候需要回归分析？

大多数人会想到做A/B测试——确实，这是模拟“比率”的最佳方案。你做两个测试，A和B。A导致销售额增长了5%，而B导致销售额增长了6%。所以B比A好。

但是，当你有大量相关的变量时，比率就变得很难计算了。假设你要解释为什么玩家不再玩你的游戏。你认为根据某些潜在的因素，你可以估计玩家何时会离开游戏，但你不肯定哪一个因素才是最有关系的。

例如，假设我们正在研究玩家的登录次数、游戏时长、最近离开的好友数量、他们得到的经验值和他们得到的成就数量。

用比率模拟所有这些变量可能永远也完成不了！在这些变量中，有些是离散的，但大部分是连续的。你得对它们划分成段（如，成就：1-5.6-10，11-15……），然后对各个段分别评级。

你得给每一个变量的可能排列设定比率，并在一个大的矩阵中比较它们。该死，应该有更好的办法才是！

好吧，这时候回归模型就派上用场了。

回归分析的作用方式

我没说我是数学教师，所以让我用业余人员能理解的方式描述回归分析吧。回归模型假设所有自变量都对目标（叫作“因变量”）有一定程度的影响。

你首先必须想出一套你认为变量如何起作用的理论——这个很重要。没有这套理论，你的工作将是盲目的，你的结果可能没有任何意义！

如果你的那套理论不起作用，你可以用回归模型证实它。回归模型也可能产生否定结果，这可以防止你浪费大量时间来研究无用的或会误导你的数据。

回到我们的模型：我们假设在这些变量中，每一个都会影响玩家退出游戏。使用最普遍的一种回归分析，即普通最小二乘方（Ordinary Least Squares 简称OLS），我们假设我们可以构造一个基本的代数方程来帮助我们决定

一名玩家是否会离开游戏。使用OLS，我们的理论用代数表示如下：

（离开的玩家） = x + (y1 * 登录次数) + (y2 * 游戏时间) + (y3 * 离开的好友数) + (y4 * 获得的经验) + (y5 * 成就数)

这正是计算机可以马上解决的代数题。

首先，我们要让数值团队的人提供给我们一些信息。但在此之前，我必须提醒你，我们得到的数据样本必须是公正的、有代表性的，这一点极其重要。

新手常犯的错误是说“我想知道是什么导致玩家离开游戏，所以我们要对所有离开的玩家做一个报告。”这太糟了，因为它导致了选择性偏差。

避免选择性偏差的方法是假设你事前并不知道研究的结果。假设你一无所知，问你的数值团队成员，

“你可以做一份报告吗？让我知道七月份时，所有玩家的登录次数、游戏时间、离开的好友、获得的经验和成就。这份报告应该只包含7月1日之前就开始玩游戏的玩家，应该排除在7月份离开的玩家。哦，再增加一列a 1，表示在8月的第一周离开的玩家，或者a 0表示没有离开的。”

原因是，这个询问达到了以下三点要求：

1、与这个数据有关的所有玩家都有相同的测量值。与这个研究有关的所有玩家都运用了一整个月的数据。

2、因变量“在8月份离开”完全与自变量分离。

3、理想情况下，我们会得到大量的结果序列。我们获得的序列越多，我们借助回归分析软件来理解变量的效果就越好。

现在我们可以开始了。我假设你得到的报告就像这样：

第一个数字表示登录次数，第二个表示游戏时间（分钟），第三个表示离开的好友数，第四个表示获得的经验，第五个表示成就，最后一个如为1表示玩家离开，如为0表示玩这未离开。

然后，我们需要一款用于回归分析的软件。

回归分析软件

假设你或公司财政状况良好，那就购买统计分析软件、Stata或Mathematica吧。什么，不会用？让你的公司送你再上一次大学吧！

至于我们其他人呢，这里有一款非常实用的软件叫作“GRETL”。对于想学习的人来说，这款软件正好用。你可以下载这款软件分析我刚才给出的测试数据。

先把数据报告保存成CSV格式，然后启动GRETL。选择File | Open Data | Import | text/CSV。指定数据分隔符，然后选择文件。

GRETL突然问你：“你需要给数据添加时间序列或面板解释吗？”我们现在选择否，因为时间序列和面板是另一节课的内容，可能不算是入门级的东西。再者，即使我的问题可能很复杂，我也会尽量先把它模拟成简单的模型，除非简单的失败了，我才尝试更复杂一点的模型。

你现在应该可以看到以下页面，其中有7个变量，包括自动产生的常量（基本上是一列数字）。

gretl1(from gamasutra)

现在我们开始模拟了！从主菜单中选择Model | Ordinary Least Squares。我们现在必须告诉Gretl我们的理论。对于因变量，选择“Cancelled_”；对于自变量，选择其他的任何选项，然后点击OK。

你应该会看到如下页面，其中有大量文本和复杂的数字。我们怎么理解这些东西呢？

gretl2(from gamasutra)

对于初学者，从表格中你应该看到两点。第一，每一行数字旁边的小星号的作用是，提示你哪一行变量是最有用的——星号越多，表示越有用。

第二，看到这句“p-value was highest for playtime”。这是提示你应该忽略图表中的哪一个变量。此时，数学告诉你，游戏时间不重要——我们不能根据游戏时间判断玩家是否打算离开游戏。

总之，任何P值接近1（或没有星号）的变量都应该从模型中除去。

为什么？我不知道；这就是你的理论派上用场的地方。可能是，有些人在决定是否离开很犹豫，所以频繁登录，反之，有些人离开得就很干脆，甚至把网站都忘记了。除非你开始做一些该领域的开创性研究，否则你不会知道这些的。这就是要拿给游戏设计或社区经理的东西！最后，你可能会发现一些有趣的东西，比如，有两种不同的游戏时间，只有一种能正确地提示玩家退出游戏的打算。但现在，我们还是忽略这个有缺陷的变量，继续往下看。

排除不和谐的变量后，再次运行模型，从主菜单中选择Test | Omit Variables，然后选择忽略“playtime”和“experience gained”，点击OK。你会看到如下页面：

gretl3(from gamasutra)

现在你已经得到一个很棒的模型了，其中的变量都是真正有用的。每个变量都有一个真正的低P值。你设计的代数公式其实是：

（离开的概率） = 1.31132 – （0.0470642 * 登录次数）+ （0.0567763 * 离开的好友数）– （0.0795353 * 成就数）

所以我们怎么在实际中运用这个公式呢？我们来看看用曲线表示的公式的结果。从主菜单中选择Graphs | Fitted， plot | Actual vs Fitted。你看到的图像如下：

gretl4(from gamasutra)

你的模型显示了确实离开的玩家得分是0.6或更高，留下来的玩家是0.4或更低。根据这个模型，你可能想开展推广优惠活动或提供赠品给那些得分高于0.6的玩家——如果该玩家过去有大量消费的记录，也可以给他确实不错的东西鼓励他继续游戏。

总结

这就是利用回归分析能做的事。我想鼓励大家多学习，但坦白说，回归分析的某些部分确实很难学也很难教。（本文为游戏邦/gamerboom.com编译，拒绝任何不保留版权的转载，如需转载请联系：游戏邦）

In-depth: Business analytics with regression

by Ted Spence

So you’ve got a nice group of players interested in your game. Maybe you’ve even got a possibility of profit coming up! Now the challenge is to keep your success rolling along.

You need to identify ways to reach out to customers, and figure out which players could benefit from a promotional offering. It’s time to develop a regression model for our data!

An introduction to regression for business analytics

I won’t beat around the bush: there’s a lot to know about regression. It was a mathematical technique cooked up by some of the smartest mathematicians ever – including Gauss, who used it to forecast the locations of planets – so this is not an easy field. But for today’s article, I’m going to just get you started.

First off, most companies have no problem generating ratios. It’s very easy to say:

“23 percent of the people who visit our site launch the game.”

“5.6 percent of people who play our game purchase something.”

“The majority of our revenue comes from 5 percent of the paying players.”

Know what? In most cases, that kind of simple math is enough to get the job done.

Lesson #1 of Business Analytics: Use the simplest tool that works.

Why is this the first lesson? Because complex tools are easy to screw up. Feynman once said, “The first principle is that you must not fool yourself, and you are the easiest person to fool.” Using complex tools can create complex and subtle problems that are difficult to anticipate and detect.

When do we want regression?

Most people get up to the point where they’re doing A/B testing – frankly, this is the best case scenario for “ratio” modeling. You do two tests, A and B. A results in a 5 percent increase in sales, and B results in a 6 percent increase in sales. Therefore B is better!

But ratios become very difficult when you have lots of interdependent variables. Let’s say you are trying to figure out why players cancel your game. You’ve got a hunch that you can predict when a player will cancel by looking at some of a few potential factors, but you don’t know for sure which one is the most relevant.

For example, let’s say we are looking at the number of times a player logged in, the duration of play, the number of guildmates who canceled recently, the amount of EXP they gained, and the number of achievements they gained.

Modeling all those variables using ratios would take forever! Some of those variables are discrete, but most of them are continuous. You’d have to slot them into buckets (1-5 achievements, 6-10, 11-15, etc) and grade each bucket separately!

You’d have to create ratios for every possible permutation of variables and compare them in a gigantic matrix. Blech! There’s got to be a better way.

Well, that’s where regression modelling comes in.

How does regression work?

I won’t proclaim that I’m a math teacher, so let me describe this in a way that a casual user can appreciate it. A regression model assumes that all your independent variables have some influence on the target (called the dependent variable).

But – and this is important – in order to get started, you first need to come up with a theory of how you think the variables work. Without a theory, your work will be blind and your results may not show anything at all!

This is the great part: if you come up with a theory that doesn’t work, regression modeling will help you confirm it. The fact that regression models can also give negative results can prevent you from spending tons of time researching data that isn’t useful (or is misleading).

Getting back to our model: Let’s presume that each one of these variables has some influence on whether a player decides to cancel. Using the most common kind of regression, Ordinary Least Squares (OLS), we will assume that we can construct a basic algebraic equation that helps us decide if a player will cancel or not. Using OLS, our theory looks like this in algebra:

(user cancelled) = x + (y1 * logins) + (y2 * playtime) + (y3 * friends_cancelled) + (y4 * exp_gained) + (y5 * achievements)

This is exactly the kind of algebra that a computer can solve in no time.

First, let’s go to your data team and ask them to provide a dump of information. But before we do, I need to remind you that it is vitally important that we get an unbiased sample of data.

A common rookie mistake is to say “Gee, I want to figure out what makes people cancel, so let’s run a report on all the people who canceled.” That’s bad because it leads to selection bias.

The way to avoid selection bias is to pretend you have no knowledge of the results of the study beforehand. Pretend you live behind the veil of ignorance. Ask your data guys,”Could you run a report that gives me logins, playtime, friends canceled, exp_gained, and achievements for all players for the month of July? The report should cover only users who have began playing before July 1st, and should exclude anyone who canceled in July. Oh – and add as the last column a 1 if the user canceled during the first week of August or a 0 if they did not.”

The reason this query works is that it meets a few requirements:

Every user who is involved in this data set has had the same measurements applied to them. Every user who participated in this study supplied an entire month’s worth of data.

Our dependent variable, “canceled in August”, is entirely separate from the independent variables.

Ideally, we’ll get lots and lots of rows of results. The more rows we get, the better our regression software can help us understand our variables.

This is good enough to get started. I’ll pretend that you got a report that looks something like this fake data I cooked up. Next, we need to get the software involved.

Regression analysis software

Let’s say you’re rich and your company overflows with money. In that case, buy SPSS, Stata, or Mathematica. Heck, get your company to send you to graduate school!

But for the rest of us, there’s a very fun useful open source package called “GRETL”, which is a great place to start for someone who wants to learn. Go download Gretl from Sourceforge, then download my test dataset, and let’s get started.

First, save your report in CSV format and launch Gretl. Select File | Open Data | Import | text/CSV. Specify the data delimiter and choose the file.

Whoah! All of a sudden, Gretl asks a question “Do you want to give the data a time-series or panel interpretation?” Let’s tell it no for now; time series and panels are topics for a different lesson, maybe not an introduction. Then again, even if I have a potentially complex problem, I’ll try modeling it as a simple one first and only move to a more complex model once the simple one fails.

You should now see a screen that looks like this. It lists seven variables, including an auto-generated constant (basically a row number).

So let’s get modeling! From the top menu, choose Model | Ordinary Least Squares. We now need to tell Gretl about our theory. For the dependent variable, select “Cancelled_”; and for the independent variables choose everything else, then click OK.

You’ll probably see a screen that looks a lot like this. That’s a lot of text and complicated numbers. How can we make sense of it all?

For a beginner, there are two things you should look for in this chart. The little asterisks next to each row of data are a little visual hint as to which variables are most useful – the more stars, the more useful.

Second, look at the bottom where it says “p-value was highest for playtime.” This is a suggestion to tell you which variables should be omitted from your model. In this case, the math says that playtime just doesn’t matter – we can’t tell if a customer is going to cancel by looking at their playtime.

In general, any variable with a p-value close to 1 (or lacking stars) should probably be dropped from your model.

Why is that the case? I don’t know in advance; this is where your theory comes in handy. It may be that some people log on a lot when they’re trying to decide whether or not to cancel, whereas others just fade away and forget about the site. You won’t know until you start gathering some raw-in-the-field research.

This is something to bring to your game designer or community manager! Eventually, you might discover something fun, like perhaps there are two different kinds of playtime and only one kind is a reliable indicator of canceling. But for the moment, let’s omit this flawed variable and move on.

To re-run the model without the bad variables, select from the main menu Test | Omit Variables; then select both “playtime” and “experience gained” to be omitted. Click OK. You’ll see another screen like this:

Now you’ve got an awesome model with some really useful variables. Every one of your variables has a really low p-value. The actual algebraic formula you created is this:

(likelihood of canceling) = 1.31132 – (0.0470642 * logins) + (0.0567763 * friends_cancelled) – (0.0795353 * achievements)

So how can we make this work in practice? Let’s graph the output of our formula and see how well it works. From the main menu, select Graphs | Fitted, actual plot | Actual vs Fitted. You’ll see this chart:

Your model basically scores people who actually cancel as 0.6 and higher; and people who don’t cancel as 0.4 and lower. Based on this model, you might want to start offering promotional discounts or freebies to people who score above 0.6 – maybe giving really good incentives if the customer has spent lots of money in the past!
Next stepsThere’s so much you can do with regression – I want to encourage you to learn, but frankly some parts of it can get daunting and aren’t easily taught.

The Khan Academy has some good regression videos, but most of the Wikipedia articles on regression have tons of complex math in them and are a bit impenetrable to novices. Gretl has a good manual too, so good luck and happy regressing!(source:gamasutra)

分享到： QQ空间新浪微博开心网人人网

上一篇:列举Zynga当前业务发展中的4大瓶颈

下一篇:独立游戏给玩家带来驾驶摩托车般的自由感