分析Playfish社交游戏架构及其成功秘诀

发布时间：2011-09-10 09:05:02 Tags：Playfish社交游戏架构,平台,服务器

在Facebook和MySpace之类的社交平台以及iPhone上，通过Playfish的游戏与好友进行社交互动的玩家每天有1000万，每月的数量超过5000万（游戏邦注：这是2010年9月份的数据，下文所涉数据均以当时为准）。在游戏行业中成长最为迅速的社交游戏领域，Playfish算是早期的革新者。Playfish还是Amazon云服务早期的使用者，完全依靠上百台云服务器来运行他们的系统。Playfish融合了许多流行趋势（游戏邦注：或许这正是为何EA花3亿美元收购的原因，他们认为Playfish可以制作出价值10亿美元的游戏）：在社交网络上构建游戏、在云端构建应用、手机游戏、利用数据驱动设计来不断演变和改善系统、灵活的开发和调度、使用出售虚拟商品的业务模式。

那么，哪些内容对小公司而言有借鉴之处呢？为，Playfish的高级工程主管Jodi Moran和首席策划师Martin Frost（二者都是Playfish首批的工程师和运营人员）在一次访谈中解释Playfish取得这般成就的原因。首先，让我们看看该公司的相关数据：

1、5000万月活跃用户

2、1000万日活跃用户

3、逾100台服务器

4、10款游戏，无时不刻地在发布更多新游戏

5、游戏被下载和安装的次数超过2亿

6、服务器与运营人员的比例为100:1

7、约200名员工和250名外包者，共运营4个工作室

Playfish(from kotaku.com)

平台

1、客户端使用Flash

2、服务器端使用Java

3、某些工具使用PHP

4、Amazon: EC2、CloudFront、S3、Hadoop/Elastic Map Reduce、Hive、某些SQS, 某些SimpleDB

5、加载平衡使用HAProxy

6、云端文件系统使用MySQL

7、Jetty应用服务器

8、服务传输使用YAMI4。提供点对点连接，等待时间较少

关键力量

构成Playfish产业结构的关键优势如下：

1、免费游戏。Playfish游戏是免费的，玩游戏不需要花钱，但是游戏的运营依然需要成本。采用这种模式的结果是，无法简单地解决硬件问题。成本必须得到控制，这样游戏才能够有效地运营下去。在其他的游戏模式中，月订阅费用可以对硬件和产品的投资有所补偿。比如，MMO（游戏邦注：即大型多人在线游戏）游戏中每个服务器有数百或数千名用户。Playfish的目标用户要比这个数量大得多。MMO提供紧凑的游戏体验，但是它们能够采取的做法就是每月收取10美元到50美元的订阅费用，这与免费模式完全不同。

2、游戏社交化。Playfish专注的是社交游戏，这意味着公司关注的是你与好友的互动。事实上，你也只能同你的社交圈中真正的朋友一起玩游戏。玩游戏的目标不是为了获得胜利。结果按照分数和等级进行排行，但是玩家可以表达自己并展现他们的设计技能。这是种不同风格的游戏玩法。比如在《Restaurant City》中，玩家的目标不是成为最富裕或最棒的人，而是拥有最具创意性和表达力的餐馆。新游戏甚至已经不再使用全球排行榜这项设计了。老游戏使用过全球排行榜，所有人可以竞争在排行榜上取得更高的位置。这些游戏所构建的挑战更为传统，促使玩家产生战胜分数比自己高的人的想法。现在的新游戏也有排行榜，但是参与者只有玩家和自己的好友。现在的排行榜更具私密性。人们关心的不是他们能够比其他国家的某些人做得更好，他们关心的只是自己比好友做得更棒。

3、异步游戏。异步游戏是指玩家可以在不同的时间玩的游戏。在社交游戏中，很少出现所有好友与玩家同时在线的情况，所以属于异步游戏。玩家不需要同时上线。在《魔兽世界》之类的MMO游戏中，玩家需要在相同的时间出现在同一个地点，他们的行动是同步的，这使得游戏特别容易受到登录时间的影响。Playfish的游戏与上述情况并不相同，这些单人或多人游戏可以异步游玩。玩家不必同时出现在相同的地方，他们在本地的客户端上玩游戏。如果你与好友一起玩游戏的话，好友的进度不会对你构成影响。这种模式产生了许多特别的设计方式。流行游戏《Scrabble》和《Texas Hold’em》便是异步游戏，玩家轮流做出行动。还有许多更具创意的游戏，比如Playfish的《Biggest Brain》，好友们在这款游戏中不断努力去证明自己是最聪明的，排行榜上显示好友的头像以及分数，这样玩家可以看到目前处于领先位置的是何人。这是款异步游戏，因为每个玩家玩游戏的过程都是独立的，但是排行榜可以让玩家查看其他人的进程。

4、休闲游戏。异步游戏可以是硬核战略游戏，比如像上世纪50年代的桌游《Diplomacy》。社交游戏更趋向于休闲游戏，玩起来很容易，只需鼠标点击几下即可，花费的时间很少，而且在任何地方都可以玩。Playfish在此基础上扩展了这个模型，，使用高水准3D图像、简单的操作、自定义角色和多人游戏挑战，这极大地提升了用户参与度。

5、缩放受限于个人社交网络的大小。在其他应用领域中，社交化意味着难以缩放，但是对Playfish而言社交化不只是个缩放问题，还是个缩放解决方案。Playfish游戏的全球化并非是上百万人喜欢在同时玩同一款游戏那种感觉。Playfish的游戏含有社交性。玩家与好友一起玩游戏，更多是为了相互间愉快的互动而不是为了竞争。这样的后果是，个人游戏内在可以得到扩张，因为它们涉及的玩家很少。比如，假设某个全球排行榜上共有5000万个用户。那么，在数据架构方面都需要大量的编写工作。而社交网络中只含有少数玩家的排行榜执行起来会更加简单。

6、大量的用户。游戏大量的活跃用户给公司的结构施加了压力，将其扩展成整个系统。每个单独的游戏或许扩展起来相对简单，但是因为现在流行的游戏如此之多，所以完成的系统扩展起来就没这么简单了。

7、成长迅速。这些游戏的设计具有病毒性，因为它们都利用了社交图表。它们都很有趣，玩法简单，而且对维持与好友间的关系有所帮助。所有的这些因素都使得社交游戏迅速扩张，这意味着Playfish不只需要处理大量的用户，还需要处理不断流入的新用户。

8、改变迅速。游戏市场千变万化，竞争也很激烈。这也是个非常新鲜的领域，所有人仍然都在摸索之中。Playfish无法等待长期的采购过程、长期的设计循环或长期的发布循环。他们必须保持公司的灵活性。

9、巨作驱动游戏本质。很难预测哪些游戏将会获得成功，所以当游戏启动之后就要马上得到扩展。当游戏成功之后，就没有时间采用新的资源。为每个游戏分派资源以取得最大化的成长或许无法发挥作用，因为公司拥有大量的游戏。

10、实验性游戏发布。游戏可以被视为针对某个目标群体的实验品。某些实验可能比其他实验更为成功。这又让我们想起了社交游戏行业的灵活本质。

11、多款游戏。Playfish并非只有单一产品的公司。其他公司只需要为单款游戏提供支持。Playfish必须同时支持和开发多款游戏，而且不时还有新游戏提上议程。这给开发和运营带来了巨大的压力。

12、智能的客户端。Playfish游戏的客户端使用的是Flash，可以储存数据、支持本地操作并且减轻大量的服务器负荷。

13、游戏是应用而不是网站。因为智能的客户端减少了许多读取次数，所以大多数据库行为是写入式。通常情况下，60%的数据库运作属于写入式，而多数网站的大部分内容则是读取。因为游戏的平台是网络，所以很容易将这些游戏当成是网站，但事实上他们只是通过网页传播的应用而已。

14、沉重的加载等待时间。Playfish游戏全球化的含义是，它们的用户分布在世界各地。因为游戏是在智能的客户端中异步游玩，所以游戏的等待时间并不重要，这意味着游戏的加载成为最为显而易见的用户体验等待时间，他们必须尽量减少这部分的时间。

15、用户的即时反馈。社交游戏通过社交网络进行数字化传播，完全跳过了商店等传统传播渠道。历史上首次出现没有中间人控制游戏传播的现象。所有用户都只需要坐在家里，就随时可以玩游戏。游戏制作者现在与他们的用户间的的联系是前所未见的。这种与用户群体产生直接联系的模式可以让游戏变得更好。用户能够给予游戏设计师即时的反馈，对所玩游戏的设计产生影响。

16、社交游戏是种服务，并非盒装游戏。传统游戏需要玩家进商店购买，因而经销商对渠道有所控制。他们的主要对象是20多岁的硬核游戏玩家。社交游戏扩展出全新的市场，吸引那些此前从未玩过游戏的人。游戏是种创造出玩家和游戏背后创造性团队之间的长期关系的服务。游戏可能每周发布一次，在运营过程中对游戏不断完善，开发团队不断认识到玩家真正想要的内容。在之前需要数年开发循环的传统游戏中，这是不可能实现的。

17、失去联系的人群正在寻找联系的方法。在《Pet Society》的虚拟世界中，玩家可以装饰自己以及好友的房屋。临近圣诞节之时，他们开始以每件2美元的价格出售虚拟圣诞装饰品。此次销售让公司赚到了4000万美元。公司很相信行为主义分析，他们会询问用户为何会去做某些事情？这样带来的结果是，他们与那些失去联系的人建立起了联系。之前他们可能会在家中摆放圣诞树，人们或许会看到该圣诞树并对他们的创意表示赞赏。现在，有些人都失去了联系，只有3到4个好友会看到真正在家中的圣诞树，但是所有的好友都可以看到你在Facebook上的虚拟圣诞树。就接触其他人而言，这使得你在虚拟世界中花费的金钱物有所值。带动销售的正是这种表达自身的愿望，虚拟商品是种方式，社交网络是个全新的领域。购买虚拟礼品的目标是为了同好友互动，就像在显示世界中那样。

18、背景模拟。诸如《Restaurant City》之类的游戏开始以背景模拟组建为特色，即便玩家不在线时仍然可以运行。这使得游戏演变出新的形式，鼓励玩家定期回到游戏中查看。比如在《Restaurant City》中，玩家需要确保餐厅的员工获得足够的食物。这个模拟成分很有趣，必须被整合到游戏基本结构中。

19、需要马上朝许多不同的方向扩展。Playfish需要扩展到能够支持更多的用户、更多的游戏、更多的用户数据、更多的用户接入、更多的开发团队以及更多的运营团队。最难以扩展的部分便是扩展能够为所有内容提供支持的人力。

Restaurant City(from hynavian.com)

关键扩展战略

为应对这些力量，Playfish已经制定了些许关键的扩展战略：

1、有趣。要驱动游戏获得成长，关键的设计度量便是增加游戏的趣味性，让玩家产生强烈的情感，觉得他们必须玩游戏而且必须要请好友加入以获得更多的乐趣。当与社交网络中的好友分享时，情感可以达到最大化。传统游戏提供的是有着卓绝音效和图像的拟真化体验。Playfish努力以社交互动为方向来设计游戏，你可以通过邀请好友来在游戏中获利。你可以独自玩《Restaurant City》，但是如果你邀请好友担任餐厅的厨师，你会感受到更大的乐趣。在游戏中添加好友能够让你的趣味感逐渐增加。用户邀请好友以获得更多乐趣的欲望正是驱动游戏获得传播和成长的重要因素。

2、基于微交易的盈利模型。微交易通常指大容量小价值的交易。Playfish通过玩家购买游戏内置虚拟道具和服务来获得盈利。Playfish也从广告和产品安置中赚钱。这个战略为免费游戏模型提供支持，同时在保持游戏可玩性自然的基础上产生盈利。

3、云服务。Playfish从一开始便完全使用云服务，2007年开启首款游戏用于测试EC2。对我们上个版块提到的许多力量而言，云服务都是几乎完美的解决方案。在这里我不想再过多阐述云服务的所有优势，但是我们很容易便可以看到它能够灵活地接过许多需求变化问题。如果游戏变得很流行，它们可以腾出更多的资源来处理加载问题。根本不需要获得的过程。如果需求量下滑，它们也会变回原先的情况。云服务的API和IaaS（游戏邦注：即服务基础结构）本质能够满足他们的灵活性的愿望、保持团队规模较小以及发布较早且频繁的需求。你或许会觉得付费游戏模型的云服务会过于昂贵，但是免费游戏的情况与之不同。

4、SOA（游戏邦注：即面向服务架构）。公司很重视SOA。这是公司结构以及如何架构团队的组织原则。每款游戏都被视为是单独的服务，他们各自发布，互不干扰。内部软件组织成提供API的成分，这些成分各自管理和扩展。

5、碎片化。Playfish是个写入重头服务。为处理写入问题，Playfish采用的是碎片化的结构，因为这是唯一可以扩展写入的方法。

6、BLOB。许多记录的存储并非为了用户。反之，用户数据存储在MySQL的BLOB中，这样便可以通过关键值方法快速接入。

7、异步性。服务器的写入与游戏并没有同步。用户无需等待服务器上的写入完成便可以继续将游戏进行下去。他们努力缩短用户的等待时间。在MMO游戏中，每次移动都必须同所有的用户交流清楚，等待时间是关键。但是这种情况并不会出现在Playfish的游戏中。

8、智能客户端。通过使用智能Flash客户端，Playfish可以利用客户端机器上的更高的处理能力。随着他们的用户逐渐增多，他们也需要增加更多的计算容量。他们必须保持服务器和客户端的平衡。游戏不同，这种比例也各不相同。智能客户端可以节省读取，而且还允许游戏独立在客户端上运行，无需同服务器对话。

9、数据驱动游戏得到改善。Playfish通过游戏收集到庞大的数据，他们用此来不断改善现有游戏并帮助决定下款游戏的内容。他们使用Amazon的Elastic Map Reduce作为他们的分析平台。

10、灵活性。贯穿所有Playfish想法的线索是对灵活性的无尽要求，即可以迅速、简单和有效地对每种情况做出回应。灵活性体现在他们对云服务、面向服务组织、快速发布循环、保持团队规模较小以及不断通过数据挖掘和用户反馈来改善游戏设计这些选择中。

基础游戏结构

1、游戏在Flash客户端中运行。

2、客户端以服务水平API的形式向HAProxy服务器发送请求，后者加载通过运行在Amazon云服务上的Jetty应用服务器加载平衡请求。

3、所有的后端应用都使用Java编写，使用面向服务结构来组织。

4、所有的Jetty服务器都是状态不定的，这使得调度和升级得到简化并且改善了可用性。

5、MySQL被当成数据阶梯使用。数据通过MySQL服务器被分解，然后以BLOB形式储存。

6、由于Playfish较早使用Amazon的云服务，所以他们无法利用加载平衡之类的后期研发成果。如果他们想要从头开始，可以选择这个方向。

7、改变异步推向服务器。系统的设计是，客户端别赋予解决动作的权利，随后服务器检查动作是否有效，而不是用户点击按键然后该动作被传输至服务器看看发生了什么事。处理在客户端上面进行。如果用户和服务之间的等待时间过长，那么用户就会因为异步保存和智能客户端而无法看到动作的进行。重点在于，当网络上存在错误或者有更长的等待时间时，用户仍然可以获得有趣的体验。

Who Has the Biggest Brain(from smartappli)

8、Playfish之前的几款游戏都是有着高分和挑战等功能的简单单人游戏，比如《Who Has the Biggest Brain》。游戏不是非常复杂，而且对服务器端得压力也不是很大。项目的开始到完成只耗费了5周的时间。这其中包括学习并根据Facebook API来编码、学习并编码AWS、编码游戏服务器和设定生产基础结构。最先的三款游戏采用的都是相同的样式：高分和挑战。这使得他们有足够的喘息空间来开始构建他们的基础结构。首款做出改变的游戏便是2008年发布的《Pet Society》。这是首款使用虚拟道具的游戏。存储的数据从用户的某些属性（游戏邦注：如化身自定义和高分等）转向每个用户可能接触到的所有道具。这给系统带来了大量的压力。因为成长过于迅速，

早期出现过许多大型的服务问题。随后他们加入了碎片化，系统就变得更加稳定。但是在此之前，他们尝试了许多其他的技术。他们尝试添加更多的读取复制品，但是这种做法没有发挥作用。他们尝试尽其所能来修补系统。

最后他们决定采用碎片化的做法。从开始到产生效果经过了2周的时间。那款游戏上的所有显示问题都得到了解决。随着时间推移，用户获得了越来越多的虚拟道具，这就使得行的量暴增。每个道具都储存在自己的行里。即便他们将用户的数据碎片化，但是还是发现老用户的碎片不断增加，尽管在此期间用户的数量保持不变。这便是转向BLOB的推动力。BLOB解决了行数过多带来的此类演示问题。随着时间推移，游戏开始变得更加复杂。《Pet Society》中并不含有模拟元素。但是，公司在2008年底发布的《Restaurant City》首次采用了离线模拟元素。在玩家离开游戏期间，餐厅能够继续运行并为你赚钱。模拟逻辑在客户端执行，服务器的作用只是审查欺诈行为。

面向服务结构

1、Playfish确实是SOA的倡导者。SOA数据和功能模块可以独立通过传播系统调用。传播成分通过API调用服务。

2、服务确保系统各个部分的独立性，协调尽量显得宽松。

3、支持复杂性管理。系统可以由独立的可理解的成分组成。

4、成分各自独立调度和升级，这可以产生灵活性，使得开发和运营团队的扩展更加容易。

5、独立服务可以独自优化。

6、当某项服务失败时，将其取消也较为简单。

7、每款游戏都是项服务，将UI和后端打包。他们并不单独发布UI和后端。

云服务

1、Playfish从开始便完全采用云服务，与2007年在EC2测试版本上发布首款游戏。

2、云服务使得Playfish能够创新和尝试新功能和新游戏，而且冲突还很小，这在快速变革的市场中非常关键。从大量的读取系统花2周时间就转变成碎片化的系统，如果没有云服务的灵活性是无法实现的。

3、云服务使得他们能够专注于特别的东西，而不是构建和管理服务器。正因为云服务的存在，运营人员无需关注机器的维护，他们可以将注意力放在更有价值的服务上，比如开发横贯所有不同服务器和游戏的自动化。

4、在设计应用时，容量现在可以发挥自己的用处。

5、服务器与运营人员的比例是100:1。也正是云服务基础结构，这么高的比率才能够实现。

6、服务器有可能崩溃，前期就必须预见到这一点。

7、无法为服务器添加内存，所以你需要提前进行扩展。

8、云服务的关键特征便是灵活性。你可以按你的需要调整灵活性。如果忽然获得了大量的流量，你无需为此感到惊奇。你无需等待服务器获取。

9、你永远都不会确定游戏腾飞得有多快。有时你会预想到，比如运动类游戏会很快流行起来，但是其他的游戏有可能忽然间就获得了很大的用户量。如果使用云服务，出现这种情况就不再是问题了。

10、刚开始时，你完全不知道游戏会在云服务中扩展到何种程度。

11、他们不能使用所有的Amazon服务，因为他们希望能够拥有自己的东西。他们理解，完全不采用他们自己的系统可能带来不必要的风险。比如，ELB和RDS就无法使用，所以他们选择构建自己的内容。现在转向这些服务毫无意义。

12、Playfish非常重视云服务，他们利用所有云服务的内容。获得更多的容量很简单。他们完全没有内部服务器。所有的开发机器都位于云端。云服务使得发布新环境毫无意义。测试碎片化很简单，而且可以将所有的内容复制到新配置中。如果是采用运行数据中心的方法的话，这会困难得多。

13、云服务的成本并不是非常高

（1）你获得所有的东西，需要开展大量的工作。以高级可用性功能为例。改变API调用，你获得了两个数据中心。云服务看起来很昂贵，但事实并非如此。

（2）需要考虑的主要成本是机会成本。这是个很大的优势。比如，当他们首次在《Pet Society》中执行碎片化时，从开始到执行只花了2周的时间。用户马上就体验到了其中的乐趣。如果让这期间的等待时间是两个月，那么很多用户会感到很不高兴。

14、Playfish在相同的区域内运行着多个可用地区。服务器相对较近，这减少了等待时间。它们不像MMO系统那样分布在各地。使用异步写入可以使得等待时间的处理上一个层次。或许游戏动作在后端服务器上的执行需要3秒的时间，但是由于是异步执行，所以用户并没有注意到。

15、CloudFront被用来减少加载等待时间。Playfish的全球化的含义是，公司游戏的用户遍布世界各地。游戏的加载时间，包括flash代码和游戏资产，是最为显著的等待时间。CloudFront可以减少这个等待时间。

数据库系统

1、MySQL被用来当成碎片化的关键值数据库，存储BLOB。

2、用户被碎片化成多个数据库群，每个都有自己的母版和读取复制品。

（1）拥有更多复制品对他们的用处很少，因为写入很繁重。几乎所有的流量都是写入。写入比较难于扩展。更多的储藏和读取复制品根本无法起到作用。

（2）在早期的结构中，他们使用了1个母版和12个读取从属，但是状况并不是很乐观。

（3）使用碎片化后，他们转变成2个母版和2个读取从属，这对读取和写入都有所帮助。

3、碎片化意味着指数更小，这意味着可以更容易与内存相匹配。储存指数确保用户查看时无需读取磁盘，查看通过RAM即可实现。

4、通过碎片化，我们可以控制每个碎片中用户的数量，这样他们就能够确保不动用到内存指数存储。

5、他们已经尝试优化用户记录的大小，这样更多的用户就可以容纳到内存中。这也是为何他们转向存储BLOB而不是使用数据行的原因。现在，每个用户都有个数据库记录。

6、所需的工作已经从数据库服务器转向应用服务器，这使得云端的水平扩展非常容易。

7、多数网站使用的读取缓存的扩展技术都对Playfish无法发挥作用。

8、碎片化用来获得更多的写入行为

（1）写入占工作量的60%。他们仍然使用MySQL用作数据存储，因为他们对其运行数据很满意。

（2）对于每个碎片，都有1个母版和至少1个读取复制品。对于多数碎片而言，都只有1个读取复制品，但是这取决于服务的接入样式。读取分裂到读取复制品中。在某些地方的读取较多，所有拥有更多的读取复制品。读取复制品也用来远程保持数据，作为备份。

9、Playfish是被纯粹的必需品所驱动。他们自行构建关键值存储，因为他们必须这样做。为何不使用NoSQL呢？他们正在考虑这个选项，但同时他们已经有可行的解决方案，他们知道这个解决方案能够产生效果。至于对NoSQL解决方案的兴趣，只在于运营层面，用来管理多个数据库。

10、在扩展情况下，你必须采用碎片化的做法。此刻，许多SQL功能就会消失。你必须自行做大量工作。现在，你只是无法在碎片化的时候添加指数而已。

11、如果转向NoSQL，就意味着你放弃了接入样式的灵活性。亲属数据库很不错，因为你可以以任何方法接入数据。

12、备份S3。

Flash——客户端

1、客户端CPU和资源随用户数量扩展，所以尽量使用客户端是明智之举。尽量将处理过程融入到客户端中。

2、改变异步写入到服务器中，这使得用户感受不到网络等待时间。

3、改变在服务器端查看，以发现作弊行为。

4、Flash使用服务等级API来与Java应用服务交流。

5、让处理过程贴近用户，使得用户获得更好的体验。

YAMI4——消息

1、在经过长期的评估过程之后，Playfish决定使用YAMI4这个消息系统。它提供了点对点连接、低等待时间、没有单点失败、事件驱动过程的异步消息和多后端平行服务。

2、在服务通过发现阶段到达习得阶段是，消息直接在服务间传输。这是种不会被破坏的模型。消息并没有传输到中央服务而后重新分配。这种模型下的消息非常搞笑，等待时间减少，因为没有中转过程。他们通常不需要单独的成分来处理流量。这种方法可以减少失败点和等待时间。

3、他们考虑过Thrift，但是YAMI4获得了胜利，因为其异步运营。Thrift使用RPC模型，看起来像是本地功能调用，这使得更难以处理错误和暂停等问题。YAMI4是个消息模型而不是RPC模型。

4、YAMI4并没有处理服务曝光层面，他们自行构建曝光阶段，然后直接与其他服务产生联系。

5、作为消息系统，消息并没有使用项目方法。也同样没有对象流于网络上。对象存在于服务中。每项服务都对促发、输送和分派运营负责。

处理多个社交网络

1、他们的主要挑战之一便是为许多不同类型的游戏提供支持。

2、宽松连接的原则使用率很高：

（1）通过团队自行拥有的服务，团队结构与服务架构相配。

（2）服务保持清晰可辨的界面，这样每个团队都可以重复并部署他们自己的服务，不会影响到其他团队。

（3）当界面需要更改时，他们尝试维持后端的兼容性，这样其他团队就不用改变所有的内容。

3、为使这些过程更加简单，所有服务都采用一整套统一的标准：

（1）统一的服务传输（YAMI4）

（2）统一的运营标准，比如服务如何配置以及如何提供监控信息

4、结合统一标准和宽松联系能够让开发和运营团队都灵活高效。

开发和运营

1.每种服务都是相互独立的。

2.资源区别于服务。一种服务出现了问题也不会对其它服务产生影响。

3.团队是按照不同服务性质划分，不过两者之间有一些重叠。

4.运营团队虽然与开发团队关系密切，但是两者却是相互区分开来的，虽然相互间存在的切换过程并不大。两个团队都不只是单纯完成自己手上工作而传递给下一步工作任务人员。就像运营团队也会涉及设计工作。

5.开发者不能发布编码。他们只能记录下编码并由运营者去执行这些编码。

6.在部署工作时每个团队都拥有足够的灵活性。很多游戏都具有每周发行周期。这就意味着游戏中的任何因素都是重复的，即拥有足够优秀的编码能够让游戏随时保持更新，而不需要变更游戏功能。每周发行周期对于那些具有虚拟产品的游戏来说非常有价值，因为用户总是想知道游戏每周都会为他们带来什么新的惊喜。而这时其他团队就可以致力于长期的游戏功能制作中。但是这都是依情况而定，虽然对于那些面向服务架构（SOA）的团队来说这是一大优势。但是如果每个团队可以独立进行自己的工作，那么也就不需要一个固定的发行周期了。

Java-服务器

1.Java支持可重复使用的组件。

2.有大量的开源程序库。

3.Java的灵活性很强：它可以用于落实网络应用，过程要求，成批处理以及事件驱动系统。

4.Java具有很多优化选项。Java进行了各种垃圾内容回收处理工作，并优化了相关性能。

游戏设计

1.将以数据为驱动的设计与早前优秀的游戏设计相结合，以此为Playfish吸引更多游戏玩家。反馈数据能够让设计者了解玩家喜欢什么以及尝试最多的是什么。以此帮助他们更好地设计新游戏，并改进现有的游戏。

2.Playfish重视行为分析。他们总是观察玩家在游戏中的行为，并对其进行相关咨询。他们通过装备了强大的客户端和服务器以观察用户行为。随之对这些观察进行处理以汇总一些有用信息，让他们知道玩家在游戏中做了什么。Playfish正在通过使用EMR/Hadoop/Hive并结合SQL（结构化查询语言），如用户界面而处理这些数据。在云服务中，EMR是一个很好的工具。大量数据都以粒状形式而被储存着，而且因为所有因素都是相关联的，让你能够轻松地找到自己想要的数据，同时这对于游戏中的数据加工也很有帮助。

3.关于用户想要的是什么这一问题，我们总会得到一些非常惊讶的答案。甚至连用户也不晓得自己想要什么。用户想要的东西总是有悖于自己平常使用的。有时候我们根据用户的需要在游戏中添加了一些成本昂贵的功能，但是最后用户却并未使用它们。而那些用户从未提到过的功能却最受欢迎。那些喜欢在论坛上发表言论的人往往也是一些不喜欢看太多内容的人，而这就让我们经常迷惑于这种矛盾的观点，就像有的玩家真心很喜欢玩游戏，但是他们却讨厌游戏中的任何改变。而这时候数据却能够帮助我们发现玩家真正喜欢的是什么。

4.游戏设计不能仅依赖各种数据，否则游戏将失去灵魂。你需要让所有数据趋于平衡。你不能只按照数据的提示去做任何事。先依靠你的灵感去判断是否添加或者改变游戏中的任何因素，然后再使用数据做进一步的完善。

5.有时候我们会耗费很多精力去制作一些游戏功能，但是玩家常常会抱怨太难了而不能完成这些功能挑战。Playfish设定了一个机制以帮助他们观察玩家在完成任务之前放弃了哪些功能，并以此寻找原因，看看是否需要进一步完善这些功能或者将其彻底删除。用户粘性是用于测量一款游戏是否受到用户喜欢的重要依据。根据用户粘性的判断，我们知道玩家是否会再次回到游戏中，而因此判断游戏的发展状况。

6.敏捷性+数据+设计=快速试验一些新的游戏功能，看看是否能够博取玩家的喜爱，从而制作出更有趣的游戏。

7.整个开发团队被分成一些小团队，让他们都能自由地发挥自己的创造性。所有员工都能畅所欲言提出能让游戏更有趣的方法。

8.让游戏进入一个利基市场（游戏邦注：针对企业的优势细分出来的市场）试行，并观察市场的反应。有些游戏因此取得了成功，但是也不乏遭遇失败的游戏。

经验教训

1.根据你的应用所具有的特殊属性创建一个特有的架构。Playfish根据自己的特殊需要定制了一个专属的架构。不要去尝试那些极其普遍的架构，最好能够利用你的空间属性，尽可能地创造出不同于别人的架构。

2.在一开始遭遇失败没什么大不了的。摒弃一些不必要的错误并学习一些经验教训。如果他们始终停留在3年前，那么他们将永远只是在建造一个系统，而不会取得任何前进。虽然他们摒弃了一些错误并不能说明什么，但是这些错误是他们在5周时间内保持观察而得到的新发现，这才是他们成功的关键。

3.不要退缩，你只需要坚持你知道并认可的事即可。选择你所熟悉的东西。你无需费力去挑选那些最新或者最棒的东西。因为你的使用经验能够告诉你什么样的产品才是最好的且最有价值的。同时这么做也能为你省去很多不必要的麻烦。

4.首先要保持简单，然后合理规划时间。利用生产周期创造出你所需要的东西。

5.分区和BLOB。使用分区去规划编写内容。通过BLOB去分析每个用户的相关记录量。这两种方法能够加快随机访问内存（RAM）中的目标访问并允许更多目标进入RAM。

6.规划时不免遇到一些棘手的问题。最好不要太早设计游戏功能，也不要创造出具有破坏性的关联。例如，未能在适当的地方规划你的游戏功能并不要紧，关键是你不能在进行分区后才发现这些功能存在着缺陷。

7.数据规划比处理规划更难。你要将数据存放在哪里？需要由多少人读取并编写数据？无状态的应用服务器能帮助你更方便地进行处理过程，但是相比较而言数据规划就难多了。

8.不要规划太复杂的东西。规划简单的系统真的容易多了。保持你的规划尽可能简单并持久，这样才能让你看清楚缺陷所在，并制定出解决方法。很多人认为从第一天起就应该尽可能地创造一些复杂的东西。例如，如果你的游戏不需要多层面的缓冲，那么就不要将其添加进去，否则你还需要劳神费心地去维护这些层面。

9.使用SOA管理各种复杂性。将分裂代码添加到一些新游戏的不同组件中，并由不同的团队去维持整个系统的复杂性，因此而更容易进行整体规划。

10.承担预期风险。虽然Playfish采取了云服务，但是为了预防失败，他们还预留了一个后补计划。虽然这种尝试是一种风险，但是也有很多福利。如果亚马逊在必须情况下改变了服务形式，那么他们也许能从中避开风险。这就是基础架构即服务（Infrastructure-as-a-Service）带来的好处。而平台即服务（Platform-as-a-Service）则存在着更多的风险，因为在此你将创建更多深层次的关联性。例如，当Playfish拥有自己的key-value数据库时，他们所面临的风险便转变为NoSQL产品所带来的风险。

11.运营和开发人员都应该清楚系统是如何贯穿于整个生产过程中。这个系统是否好管理？是否能支持用户的体验？如何进行配置和监控？尽可能地去解决这些问题。同时需要注意的是开发者没有发布游戏代码的权利。

12.通过数据去找出用户真正喜欢的东西。连用户也总是分不清自己真正想要的是什么。设置好你的代码，并通过分析数据去寻找有意义的模式，使用相关信息去完善游戏系统。

13.规划人员是最困难的一件事。你应该想办法把部署人员工作变得更加简单。而使用更多服务器将能够帮你解决这个难题。

游戏邦注：本文发稿于2010年9月21日，所涉时间、事件和数据均以此为准。（本文为游戏邦/gamerboom.com编译，如需转载请联系：游戏邦）

Playfish’s Social Gaming Architecture – 50 Million Monthly Users And Growing

Ten million players a day and over fifty million players a month interact socially with friends using Playfish games on social platforms like The Facebook, MySpace, and the iPhone. Playfish was an early innovator in the fastest growing segment of the game industry: social gaming, which is the love child between casual gaming and social networking. Playfish was also an early adopter of the Amazon cloud, running their system entirely on 100s of cloud servers. Playfish finds itself at the nexus of some hot trends (which may by why EA bought them for $300 million and they think a $1 billion game is possible): building games on social networks, build applications in the cloud, mobile gaming, leveraging data driven design to continuously evolve and improve systems, agile development and deployment, and selling virtual good as a business model.

How can a small company make all this happen? To explain the magic I interviewed Playfish’s Jodi Moran, Senior Director of Engineering, and Martin Frost, Chief Architect, first Engineer and Operations guy at Playfish. Lots of good stuff, so let’s move on to the nitty gritty.

Stats

1. 50 Million Monthly Active Users

2. 10 Million Daily Active Users

3. 100s of Server Machines

4. 10 Games, more being released all the time

5. 200 million games have been downloaded, installed, and played

6. 100:1 ratio of servers to operations people

7. About 200 employees and 250 contractors running out of 4 studios

Platform

1. Flash on the Client Side

2. Java on the Server Side

3. Some PHP for tools

4. Amazon: EC2, CloudFront, S3, Hadoop/Elastic Map Reduce, Hive, some SQS, some SimpleDB

5. HAProxy for load balancing

6. MySQL for sharded, blob storage

7. Jetty Application Servers

8. YAMI4 for service transport. Gives point-to-point connectivity, low latency.

Key Forces

What are the key forces that shape Playfish’s architecture?

1. Games are free-to-play (F2P). Playfish games are free-to-play: they don’t cost money to play, but they still cost money to run. The consequence of this model is that it’s not possible to simply throw hardware at problems. Costs must be controlled, so they try to run lean and efficient. With other game models monthly subscriptions subsidize hardware and product investment. MMOs (massively multiplayer online game), for example, run hundreds or thousands of users per server. Playfish targets orders of magnitude above that. MMOs provide a server intensive game experience, but the only way they can to do that is by charging $10-$50 a month subscription fees, which is a completely different space.

2. Games are social. Playfish focuses on social games, which means it’s about you engaging and interacting with your friends. In fact, you can only play with your real friends, your social graph.

Games are not really played to win. Results are ordered by points and level, but people play more to express themselves and show off their design skills. It’s a different type of gaming. In Restaurant City, for example, the goal is not to be the richest or the best, but to have the most creative and expressive restaurant. Newer games don’t even use global leader boards anymore. Older games used a global leader board where everyone competed against everyone else. Games were structured more traditionally with challenges and a “look my score it’s bigger than yours” attitude. Now they have a leader board, but it’s just between you and your friends. It’s more intimate now. People don’t care about if they are better than someone in a different country, they care if they are better than their friends.

3. Games are asynchronous. Asynchronous games are games where players can play at different times. Since your friends are rarely all on-line at the same time, social games are played asynchronously. You don’t all have to play at the same time. In a MMO like Worlds of Warcraft, for example, players exist in the same space at the same time and they can act simultaneously, which makes them very latency sensitive and difficult to scale. Playfish’s games are different, they are single or multiplayer games, played asynchronously. Players don’t exist in the same game space at the same time, they play in their local client. While you play with your friends, you are not blocked on your friends, you can make progress on your own. This model leads to all kinds of specialized design possibilities. Examples of asynchronous game play are familiar games like Scrabble or Texas Hold’em, where players take turns. More innovative are games like Playfish’s Biggest Brain, a game where friends keep trying to prove who is smarter, uses a leader board to display pictures of your friends and their scores, so you can always see who is in the lead. It’s asynchronous because each player plays independently, yet the leader board allows all the players to check each other’s progress.

4. Games are casual. Asynchronous games can be hardcore strategy games, something like the Diplomacy board game from the 1950s. Social games tend to be casual games that are really easy to play, can be played with a few mouse clicks, in short bursts of time, from any location. Playfish extends this model with high production value 3D graphics, easy controls, customized characters, and multiplayer challenges, which leads to very high user engagement.

5. Scale is limited by the size of individual social networks. Contrary to other application areas where social means hard to scale, for Playfish social isn’t just a scaling problem, it’s also a scaling solution. Playfish games are not global in the sense that a million people don’t try to play the same game at the same time. Playfish games are social. They are played with your friends more for pleasurable interaction than down and dirty competition. The consequence of this is that individual games are inherently scalable because they involve so few players. Take, for example, a global leader board with 50 million users. That’s a lot of writes on a single data structure and that may require sharding to make it scale. A leader board for a social network of a few players is much more straightforward to implement.

6. Large numbers of users. The shear number of users actively playing games puts pressure on their architecture to scale as a total system. Each individual game may be relatively easy to scale, but with so many active games, the complete system is not so easy to scale.

7. Rapid growth. By design these games are viral because they exploit the social graph. They are also fun, easy to play, and help maintain relationships between friends. All these factors contribute to social gaming’s rapid expansion, which means Playfish don’t have to just deal with a lot of users, they have to deal with a continual stream of new users.

8. Rapid change. The game market moves rapidly and there’s lots of competition. This also is a very new field, everyone is still learning what works and doesn’t work, there’s a lot of churn.

Playfish can’t wait on long procurement processes, long design cycles, or long release cycle. They must stay agile.

9. Hit driven nature of games. It’s difficult to predict which games will be successful, so when a game does take off it must be able to expand immediately. There’s no time to bring on new resources when a game hits it big. To allocate resources for the maximum projected growth for each of their games simply wouldn’t work. They have too many games, which would require an absurd amount of servers that would most likely go unused.

10. Experimental game release. Games can be seen as experiments targeted at a niche. Some experiments will be more successful than others. This gets back to the very agile nature of the social game industry.

11. Multiple games. Playfish is not a single product company. Other companies support a single game. Playfish must support and develop multiple games simultaneously, with new games always pushing on the schedule. This puts a lot of pressure on development and operations.

12. Smart client. Playfish games are written in Flash, which is a very capable rich client. It can intelligently cache data, support local operation, and remove a lot of load of the servers.

13. Games are applications, not web sites. Since the smart client removes many of the reads, most of the database activity is write heavy. 60% of the load on a typical database master is writes, whereas most web sites tend to be read heavy. Since games are played over the web it’s easy to think of them as web sites, but they are really applications delivered over the web.

14. Heavy load time latency. Playfish games are global in sense that they have users all over the world. Because the game is played asynchronously in a smart client, game latency is not critical, which means when a game is loading–flash game + assets–is the most noticeable latency the users experiences and they must reduce this as much as possible.

15. Real-time feedback from users. Social games are digitally distributed over social networks via the web, completely bypassing the traditional distribution channels through stores or telcos. For the first time there’s no middleman controlling game distribution. Any user can just sit back at home and play when and with whom they want. Game makers can now connect with their audience in a way that was never possible before. It’s possible to connect directly with a user base to help evolve and make games better. Users can give game designers real-time feedback and effect the design of games they play.

16. Social games are a service, not a box. Traditional games are bought in a box, on a shelf, and distributors control the channel. They sell into the same 25 year old hard core gamer demographic.

Social games are expanding out into new markets, attracting people who have never played games before. Games that are a service create a very long relationship between players and the creative team behind the game. It’s possible to do weekly releases, nurture the game along, and continually learn what players want. This wasn’t possible before when you had multi-year development cycles.

17. People are disconnected and are seeking ways to connect. Pet Society is a virtual world where players decorate their house and their friend’s house. Around Christmas time they started selling virtual Christmas and ornaments for $2 each. They sold $40 million dollars worth of virtual currency. Big believers in behavioral analysis, they asked their users why they would do that? It came down to connecting with people they were disconnected from. Previously they would have a tree in their home and people would see the tree and their creativity. Now people are all disconnected so 3-4 friends would see the real Christmas tree whereas all their friends would see the virtual Christmas tree on Facebook. There’s a bigger bang for your virtual buck in terms of reaching people.

What drives purchases is the desire to socially express themselves, virtual goods are the means, and social networks are the new territory. Virtual gifts are bought for the perceived value in the interaction with your friends, just like in the real world.

18. Background simulation. Games, like Restaurant City, are starting to feature a background simulation component that runs even when the user is not online. This causes the game to evolve in such a way that encourages player to check back regularly. In Restaurant City, for example, players need to make sure restaurant employees are being fed. This simulation component is an interesting new load that must be integrated into the game infrastructure.

19. Need to scale in several different dimensions at once. Playfish needs to scale to support more users, more games, more data per user, more accesses per user, more development staff, and more operations staff. The hardest thing to scale is scaling the people that support everything.

Key Scaling Strategies

In response to these forces, Playfish has followed a few key scaling strategies:

1. Fun. To drive growth the key design metric is to create a game so fun that it generates such strong emotions that people feel they simply must play and must invite their friends to join to have even more fun. Emotions are maximized when shared within a social network of friends. Traditional games are immersive experiences with great sound and graphics. Playfish tries to design games for social interaction, where you get a benefit to the game by inviting friends. You could play Restaurant City on your own, but it’s more fun to invite your friends to a be a cook in your restaurant.

You get incrementally more fun by adding friends into a game. The desire for users to involve friends in order to have more fun is what drives greater distribution and growth.

2. Micro-transaction based revenue model. Micro-transactions are typically high volume, low-value transactions. To pay for the service Playfish makes money by players purchasing in-game virtual items and services. Playfish also makes money from ads and product placements. This strategy supports the free-to-play model while providing revenue based on natural game play.

3. Cloud. Playfish has been 100% cloud based from the very start, launching their first game as a beta on EC2 in 2007. The cloud is almost the perfect answer to a large number of forces we covered in the previous section. I won’t bore you by talking about all the wonders of the cloud, but it’s easy to see how elasticity helps solve many of their variable demand problems. If a game becomes popular they can spin up more resources to handle the load. No procurement processes necessary. And if demand falls they can simply give the instances back. The API/IaaS (Infrastructure as a Service) nature of the cloud easily satisfies their desire for agility, the need to keep teams small, and requirement to release early and often. You may think with pay-to-play model the cloud is too expensive, we’ll talk more about this in the cloud section, but they don’t see it that way. They are far more concerned about the opportunity cost of not being able to develop new games and improve existing games in a rapidly evolving market.

4. SOA (Service Oriented Architecture). They are very big on SOA. It’s the organizing principle of their architecture and how they structure their teams. Each game is considered a separate service and they are released independently of each other. Internally software organized into components that offer an API and these components are separately managed and scalable.

5. Shard. Playfish is a write heavy service. To deal with writes Playfish went to a sharded architecture because it’s the only real way to scale writes.

6. BLOB. Multiple records are not stored for a user. Instead, the user data is stored in a BLOB (Binary Large Object) in MySQL, so it can be quickly accessed via a key-value approach.

7. Asynchronicity. Writing on the server is asynchronous from game play. The user does not have to wait for writes to complete on the server to continue playing the game. They try to hide latency from the user as much as possible. In a MMO each move has to be communicated to all the users and latency is key. Not so with Playfish games.

8. Smart Client. By using a smart Flash client Playfish is able to take advantage of the higher processing power on client machines. As they add more users they also add more compute capacity. They have to get right balance of what is on the server and client side. The appropriate mix varies by game. The smart client caches to saves on reads, but it also allows the game to be played independently on the client, without talking to the servers.

9. Data driven game improvement. Playfish collects an enormous amount of data on game play that they use to continually improve existing games and help decide what games to invent next. They are using Amazon’s Elastic Map Reduce as their analytics platform.

10. Agility. A common thread through all of Playfish’s thinking is the relentless need for agility, to be able to respond quickly, easily, and efficiently to every situation. Agility is revealed in their choice of the cloud, organizing around services, fast release cycles, keeping teams small and empowered, and continually improving game design through data mining and customer feedback.

Basic Game Architecture

1. Games run in Flash clients.

2. The clients send requests, in the form of a service level API, to HAProxy servers which load balance requests across Jetty application servers that run in the Amazon cloud.

3. All back-end applications are written in Java and are organized using a services oriented architecture.

4. The Jetty servers are all stateless, which simplifies deployments, upgrades, and improves availability.

5. MySQL is used as the data tier. Data is sharded across MySQL servers and stored in a BLOB format.

6. Playfish was an early adopter of Amazon’s cloud so they were unable to make use of later Amazon developments like load balancing. If they had to start from scratch they would probably go that direction.

7. Changes are pushed to the server asynchronously. Rather than a user clicking a button and that action sent to the server to see what happened, the system is designed so the client is empowered to decide what the action is and the server checks if the action is valid. Processing is on the client side. If there is a high latency between the user and the service, then the user won’t see it because of asynchronous saving and the smart client. At the end of the day it’s what the user sees that matters. The important thing is that when there are glitches in the network or higher latency, that the user has a fun game experience.

8. Playfish’s first few games were simple single player games, like Who Has the Biggest Brain, with features like high score and challenges. Not very complicated and not very heavy on the server side. They only had 5 weeks from start to finish on the project. That included the learning and coding to Facebook APIs, learning and coding to AWS, coding game servers, and setting up production infrastructure. The first three games continued that pattern: high scores and challenges. That gave them some breathing room to start building out their infrastructure. The first game that changed things was Pet Society in 2008. It was the first game that had significant use of virtual items. Data storage went from storing a few attributes per user, like avatar customization and high score, to storing potentially thousands of items per user for all the virtual items. This put a lot of strain on the system. There were some big service problems in the early days as they grew very very fast. Then they put in sharding and the system became much more stable. But first they tried various other techniques. They tried adding more read replicas, like 12 at one point, but that didn’t really work. They tried to patch the system as much as best they could. Then they bit the bullet and put in the sharding. It took 2 weeks from start to roll out. All of performance problems went away on that game. Over time users acquired lots and lots of virtual items, so the volume of rows exploded. Each item was stored in its own row. Even if you split users into shards they found for older users the shards would keep growing and growing, even if the number of users stayed the same. This is one of the original drivers to going to BLOBs. BLOBs got rid of the multiple rows that caused such performance problems. Over time games started getting more complicated. Pet Society didn’t have any simulation elements. Then they launched Restaurant City at the end of 2008 which had the first offline simulation element. Your restaurant continued to run and earn money while players are away. This introduced challenges, but adding the extra processing in the cloud was relatively straight forward. The simulation logic was implemented in the client and the server would do things like check for fraud.

Service Oriented Architecture

1. Playfish are really big advocates of SOA. A SOA encapsulates data and function together into components that can be deployed independently through a distributed system. The distributed components talk through an API called service contracts.

2. Services make sure the dependency between all the parts of the system are well known and as loosely coupled as possible.

3. Supports complexity management. The system can be composed into separate understandable components.

4. Components are deployed and upgraded independently, which gives flexibility and agility and makes it easier to scale development and operations teams.

5. Independent services can be optimized independently of other services.

6. When a service fails it’s easier to degrade gracefully.

7. Each game is considered to be a service. The UI and the backend are a package. They don’t do separate releases of the UI and the backend.

The Cloud

1. Playfish has been 100% cloud based from the very start, launching their first game on a beta version of EC2 in 2007.

2. Cloud allows Playfish to innovate and try new features and new game with very low friction, which is key in a fast moving market. Moving from their many read replica system to a sharding system took 2 weeks, which couldn’t have been done without the flexibility of the cloud.

3. The cloud allows them to concentrate on what makes them special, not building and managing servers. Because of the cloud operations doesn’t have to focus on machine maintenance, they can focus on higher value service, like developing automation across all their different servers and games.

4. Capacity is now seen as a commodity when designing applications.

5. The ratio of servers to operations people is 100:1. Such a high ratio is possible because of the cloud infrastructure.

6. Servers fail so this must be planned on from the start.

7. It’s not possible to keep adding memory to servers so you may have to scale out earlier than you would like.

8. Key feature of the cloud is flexibility. You can be as agile as you want. You don’t need to be surprised when suddenly get a lot of traffic. You don’t have to wait for procurement of servers.

9. You never know how quickly a game will take off. Sometimes you do know, sports games go quickly, but other games may suddenly explodes. In the cloud that doesn’t have to be a problem.

10. From the beginning there was never an expectation that they could scale-up in the cloud. Everything is designed to scale by adding more machines.

11. They can’t use all the Amazon services because they had to roll their own. Switching away from their own systems that they understand would be unnecessarily risky. ELB and RDS weren’t available, for example, so they had to build their own. Switching to those services now wouldn’t make sense.

12. Playfish is cloud to the core. They take advantage of everything they can in the cloud. More capacity is acquired with ease. They have no internal servers at all. All development machines are in the cloud. The cloud makes it trivial to launch new environments. To test sharding, for example, is easy, simply copy everything over with a new configuration. This is much harder when running in a datacenter.

13. The cloud is not more costly than bare metal when you consider everything.

(1) All the stuff you get, it would be a lot of work. Take the advanced availability features, for example. Change an API call and you get double datacenters. You can’t do that in a bare metal situation. Just consider the staffing costs to setup and maintain. The cloud looks really expensive, but when you get really big capacity breaks start kicking in.

(2) The major costs to consider are opportunity costs. This is the single biggest advantage. For example, when they first implemented sharding in Pet Society it 2 week from start to deploy. Users were immediately happy. The speed of their implementation relied on being able fire up a whole load of servers in production and test and migrate data. If you had a two month lead time you would have had a lot of unhappy users for two months.

14. Playfish runs in multiple availability zones within the same region. Servers are relatively close which reduces latency. They aren’t spread out like in MMO systems. Latency is dealt with at a higher level using asynchronous writes, caching in the client, and caching in a CDN. It can take 3 seconds to perform a game action back on the server, but because it’s async, users don’t notice.

The CDN helps reduce what they do notice, which is asset and game loading.

15. CloudFront is used to reduce load latency. Playfish is global in sense that have users all over the world. The loading time of the game, which includes the flash code plus game assets, is their most noticeable latency. CloudFront reduces this latency as it spreads the content out.

Database System

1. MySQL is used as a sharded key-value database to store BLOBs.

2. Users are sharded across multiple database clusters, each with their own master and read replica.

(1) There’s little benefit for them to have more replicas because they are write heavy. Nearly all the traffic is writes. Writes are harder to scale. Can’t cache and more read replicas don’t help.

(2) In an earlier architecture they had one master with 12 read slaves, which didn’t perform well.

(3) With sharding they went from one master and 12 read replicas to two masters and two read replicas, which helped with both reads and writes.

3. Sharding meant indexes got smaller which means they could fit in memory. Keeping the index in cache ensures a user lookup doesn’t have to hit the disk, lookups can be served from RAM.

4. By sharding they can control how many users are in each shard, so they can be sure they won’t blow their in-memory index cache and start hitting the disk.

5. They have tried to optimize the size of a user record so that more users will fit in memory. This is why they went to storing BLOBs instead of using data normalized into rows. Now there is one database record per user.

6. Work is taken out of the database server and moved to the application servers, which are very easily scaled horizontally in the cloud.

7. Most websites use scaling techniques, like memcache, for read caching are that aren’t that useful to Playfish. With a Flash client most of what would be cached in memcache is cached in the client. Send it once to the server and it’s stored.

8. Sharding is used to get more write performance.

(1) Writes are 60% workload. Usually it’s 10:1 the other way. They still use MySQL for data storage because they are very comfortable with it’s performance statistics under load, etc.

(2) For each shard there’s a master and at least on read replicas. For most there’s just one read replica, but it depends on the access pattern of the service. Reads are split to the read replicas.

For a few places that do have more reads they have more read replicas. Read replicas are also used to keep data remotely as a backup.

9. Playfish is driven by pure necessity. They built their own key-value store because it had to be done. Why not use NoSQL? They are looking into the options, but at the same time they have a solution that works, that they know how it will behave. Interested in NoSQL solution for the operations side, for managing multiple databases. It wasn’t easy to go into the mode of running NoSQL, but it was a necessity driven by their requirements.

10. In a scale-out situation you have to go to something like sharding and at the point many SQL features go away. You have to do a lot more work yourself. Now you just can’t add an index when you have blobbed and sharded.

11. When going NoSQL you are giving up flexibility of access patterns. Relational databases are good because you can access the data in anyway you want. For example, since they can’t use SQL to sum up fields anymore, they both aggregate on the fly or they use a batch process to aggregate.

12. Backup is to S3.

Flash – The Client

1. Client side CPU and resources scale with the number of users so it’s sensible to make use as much as possible of the client. Push as much processing as possible to the client.

2. Changes are written asynchronously back to the server, which helps hide network latency from the user.

3. Changes are checked on the server side to detect cheating.

4. Flash talks to the Java application servers using a service level API.

5. Bringing processing closer to client gives the user a better experience. A website brings servers closer to users. Playfish brings the processing even closer, on the client.

YAMI4 – Messaging

1. YAMI4 is the messaging system Playfish decided to use after a long evaluation process. It offers point-to-point connectivity, low latency, no single-point-of-failure, and asynchronous messaging for event-driven processing and parallelism in multiple backend services.

2. After services go through a discovery phase to learn where each endpoint is located, messages are transported directly between services. It’s a brokerless model. Messages aren’t funneled into a centralized services and then redistributed. Messaging is very efficient with this model, latency is reduces because there are no intermediary hops. They specifically did not want a separate component to handle traffic and then transfer messages around. This approach reduces failure points and latency.

3. Considered Thrift, but YAMI4 won because of it’s asynchronous operations. Thrift uses a RPC model that looks like a local function call, which makes it harder to deal with errors, timeouts, etc.

YAMI4 is a messaging model, not a RPC model.

4. YAMI4 doesn’t handle the service discovery aspect, they built their own on top that does a discovery phase and then talks directly to the other service. It’s more how the Internet works.

5. As a messaging system, messages do not invoke methods on objects. No objects flow over the network either. Objects live in the services. Each service is responsible for activating, passivating, and dispatching operations.

Dealing With Multiple Social Networks

1. One of their main challenges is to be able to support so many different and increasingly, different types of games.

2. The principle of loose coupling is used as much as possible:

(1) Team structure is matched to architecture by services being owned by teams.

(2) Services keep well-defined interfaces, so that each team can iterate on and deploy their own service without affecting other teams.

(3) When interfaces need to change, interfaces are versioned. They try to maintain backward compatibility so other teams do not have to roll out changes.

3. To facilitate all of this, a common set of standards is applied to all services:

(1) Common service transport (YAMI4)

(2) Common operational standards such as how services are configured and how they provide monitoring information.

4. A combination of common standards and loose coupling allows both development and operations teams to be agile and efficient.

Development and Operations

1.Services are released independently from each other.

2.Resources are separated by service. A problem with one service will not impact an unrelated service.

3.Teams are organized by service, although there is some overlap.

4.Operations teams are separate from the development teams, though there is close relationship. No big handoff phases. They don’t just throw a release over the wall. Operations is included in design.

5.Developers don’t release code. They check it in and operations picks it up.

6.Teams have a lot flexibility in how they deploy. Most games have a weekly release cycle. Everything is iterative, which means the code is good enough to go live, but features are not necessarily completely finished. The weekly release cycle is especially valuable for games with virtual items because users like to know there’s new exciting stuff every week. Other teams can work in longer feature chunks. It depends. That’s one of the advantages of working on SOA. Teams can work independently. There’s doesn’t need to be one release cycle.

Java – The Server

1.Java supports creating clean reusable components.

2.There are a wealth of open-source libraries.

3.Java is flexible: it can be used to implement web applications, process requests, batch processing, and event driven systems.

4.Java has many tuning options. They’ve done a lot of work tuning garbage collection to hone performance.

Game Design

1.To make games users will enjoy Playfish combines data driven design with good old human inspired game design. Data is used to inform a lot of decisions about making games that users will enjoy. They can see from the data what users do the most. That tells them how to make and individual game better, but also what to do when designing new games.

2.Playfish loves behavioral analysis. They look at what people are doing inside games and then asks them why. In support, the client and server are heavily instrumented to generate events on user actions. These events are then processed to create an aggregated information about what the users have done in the games. They are now using EMR/Hadoop/Hive to be able to access this huge collection of data with a SQL like interface. Using EMR in cloud is working stunningly well, they are very happy with it. Huge amounts of data can be stored in a granular form and it’s still possible to quickly find out what you want because everything is designed to work in parallel, which makes an excellent fit for the event type data produced by games.

3.It’s quite surprising what users will want. They may not even know it themselves. What people think they want is different than what people actually use. Sometimes costly features are put in that users say they really want, but they end up not using them. Then there are features that nobody mentions but people use all the time. People who post to forums are often people who dislike stuff, so it’s easy to get a very unbalanced view because they are really passionate about the game, they hate every change. Data helps find out what people really love.

4.Game design can’t be just data driven or games will be soulless. You have to get the data driven balance right. You can’t just do what data says do. Go with your inspiration about what to add or change in a game, then use the data to help polish it up.

5.Some features take a lot of work to create, but users don’t end up them because the features are too complicated. They setup funnels to see if people abandoned features before completing them, then they can figure out why. Maybe a feature needs to be tweaked or abandoned. Engagement is tracked as a measure of the love users have for a game. They then invest in content that keeps people coming back so the ecosystem will grow.

6.Agility + Data + Design = allows new inspired features to be tried out quickly to see if they work or not. The result is a funner game.

7.Teams are organized into small close-nit groups that have full creative freedom. All team members get a say on what makes a game more fun.

8.Experiments with games are run to go after a niche to see if the niche responds. Some will be successful, some less so.

Lessons Learned

1.Build an architecture that takes advantage of the special nature of your application. Playfish has tailored an architecture to their very specific needs. Don’t try to build a general architecture that will scale everything. Take advantage of the nature of your space to make life as easy as possible.

2.Don’t worry about getting it right first time. Get something out the door and start learning. If they tried to get where they are now 3 years ago, they would still be building the system. They got something out the door that couldn’t scale at all, but they got it out in 5 weeks. This is the key.

3.Don’t be afraid to stick with things you know and understand. Pick stuff you are familiar with. You don’t need to go with the newest and coolest. There’s a lot of value in knowing a product well and being able predict how it will operate on your use cases. You will get into a lot less trouble that way.

4.Keep things simple first then scale when need to. Use that lead time to build out what you need.

5.Shard and BLOB. Use sharding to scale out writes. Use BLOBs to reduce the number of records per user to one. This speeds up object access and allows more objects in RAM.

6.Always have rough idea how you are going to scale. Don’t design features too early, but don’t create disruptive dependencies. For example, it’s fine to launch without scaling in place, it’s not cool to have a feature that isn’t going to work after you shard.

7.Scaling data is harder than scaling processing. Where are you storing data? How many people need to read and write it? Stateless app server make it easy to add more processing, scaling the data is a lot harder.

8.Don’t over complicate things. Scaling a simple system is easier. Keep as simple as possible for as long as possible. This allows you to learn where your pain points are and then you can address those specifically. A lot people think they have to build the huge thing from day one. For example, if you don’t need multiple levels of caching then don’t put it in, because you have to support it.

9.Use SOA manage complexity. As new games are added splitting code up into different components that are managed by different teams helps keep the overall complexity of the system down, which helps make everything easier to scale.

10.Take calculated risks. Playfish went to the cloud but had a backup plan in case it failed. It was a risk, but it also had a lot of benefits. If Amazon pulled their service then if they really had to they could move. That’s the benefit of using Infrastructure-as-a-Service. With Platform-as-a-Service there’s a much greater risk because you build a lot deeper dependencies. For example, the risk o fPlayfish moving to a NoSQL product when they have their own working key-value database is also too great.

11.Operations and development should understand how a system will work in production. Is it easy to manage? Is it easy to support customers? How will it be configured and monitored? Develop what you need to make all this happen. Developers can’t just throw code over the wall. There should be no wall.

12.Use data to help find out what people really love. Users don’t always know what they really like best. Instrument your code and analyze the data to find meaningful patterns and use that information to continually improve the system.

13.Hardest thing to scale is people. Have to make it really easy for people to do their work. It’s a lot easier to get a lot more servers than it is to get a lot more people.

I’d like to thank Playfish for taking the time for this informative and insightful interview. If you would like your architecture featured on HighScalability.com, please contact me to get started.

Playfish is looking for people. For career opportunities please see their Jobs Page. From my dealings with a few of the folks a Playfish they seemed to have their stuff together. Worth a look as they have many open positions and they are looking for server side engineers. They want to build new systems to understand users better using EMR/Haddop/Hive and scale the overall system to add more users and games.（source：highscalability）

分享到： QQ空间新浪微博开心网人人网

上一篇:解析社交游戏获得成功的五大要素

下一篇:PopCap亚洲高管分享本土化运营经验