游戏邦在:
杂志专栏:
gamerboom.com订阅到鲜果订阅到抓虾google reader订阅到有道订阅到QQ邮箱订阅到帮看

Jason Pearlman谈《Draw Something》后台技术发展

发布时间:2012-04-20 09:14:46 Tags:,,,

作者:Jason Pearlman

我至今已在OMGPOP工作了将近4年,见证着它从最初的约会网站转变为游戏公司,并最终扎根于手机游戏的整个发展过程。而在这期间我们也经历了各种挑战,并尝试了各种不同的技术和运营计划。

我一直觉得我们是个小型团队,可以利用快速原型,敏捷开发以及其它最新技术以获得优势。同时,身处于游戏领域我们还需要尽可能地尝试各种不同的理念,以判断哪些才真正合适。在这几年时间里,我创造了虚拟角色系统,文本冒险游戏引擎,全功能的共享数据库系统,基于我们的javascript游戏引擎的多人即时平台游戏,AIM bot系统,以及基于我们所创造的bot框架而创造起来的具有各种聊天功能的游戏等。

但是所有这些游戏的后台系统却是由一个很小的系统团队所支撑,这个团队中只有三名成员,包括Christopher Holt,Manchul Park以及我自己。我们都是白手起家地创造所有内容,并始终坚信我们能够有效地创建并灵活调整游戏的后台系统。直到《Draw Something》的出现。

draw_something(from idownloadblog.com)

draw_something(from idownloadblog.com)

回顾之前

早在4年前我们便在自己的网站OMGPOP.com(游戏邦注:后来改为iminlikewithyou.com)上推出了《Draw Something》的雏形,那时候叫做《Draw My Thing》,是一款即时绘画游戏。这是一款有趣的游戏,并且拥有较多的玩家基础(主要是受益于我们网站当时不错的发展)。我们同时也创造了这款游戏的Facebook版本,并吸引了许多忠实玩家。

去年,我们决定创建这款游戏的手机版本。但是与此同时,OMGPOP也仍然在不断探索着最适合自己的发展方向。我们一直致力于创造出一款真正热门的游戏。并且与大多数开发者一样,这就意味着我们需要尽可能创造出更多游戏。即使是面对《Draw Something》也不例外。我们知道这是一款具有潜力的游戏,但是却没人能够预见它最终的发展。从技术角度来看,我们并未区别对待这款游戏。所以我们的后台团队始终致力于快速而高效地为这些游戏提供技术支持。

我们已经学会了如何简单做事。最初《Draw Something》的后台被设计为拥有版本控制的简单key/value储存模式。我们将服务嵌入现有的ruby API(使用merb框架以及thin网络服务器)。我们的最初理念是将之前所创造的所有内容用于现有的API中,并为《Draw Something》编写一些新的key/value内容。因为我们是根据比例进行设计,所以我们一开始便选择Amazon S3作为我们所有key/value数据的储存库。而采取这一做法的初衷便是放弃一些等待时间并获得无限量的可扩展性和储存空间。

并且我们之后所添加的各种内容都是以标准化模式进行。任何想要建造一个可扩展系统的人都必须想办法让每个系统规模的层面能够独立于其它层面。我们将NGINX网络服务器(它指向HAProxy软件负载均衡器)作为前台网络,它随后将打开我们运行于thin网络服务器中的ruby API。而MySQL便是我们此时的主要数据储存处。我们将大量使用缓存系统以及redis进行异步排列,并使用ruby库存调用resque。

飞速发展

在《Draw Something》发行后几天我们开始注意到一个非常奇特的现象。这款游戏正在快速发展着。发布当天它便取得了3万次的下载量;而在发布10天后,其下载量更是飙升至七位数。

各种名人开始在twitter上谈论这款游戏——包括John Mayer(游戏邦注:美国创作歌手)和Pauly D,更是为我们的游戏创造了更多的流量。之前的玩家也未离开游戏,他们深深陷入游戏的乐趣中,游戏的总体使用率甚至远远高于它每天的下载量。

大多数工程师都是根据比例去开发软件,但是要知道,在任何复杂的系统中即使你尝试着为其设下基准并进行测试,你也很难判断哪些内容会以何种方式遭遇失败,以及你需要在何时修改何种系统等。

解决问题

我们所遇到的第一个问题便是我们平常所使用的API速度过快,这就意味着我们只能以平常的方法(即单线程,一次只执行一次请求)去使用thin网络服务器,而对于公共云服务,我们却不知道有多少时间能够用于内容备份。

虽然我们看到一些内容开始备份,但是这一方法却并非长久有效。同时我们也通过不断地创造出更多服务器而节约时间。幸运的是,我们以此设计出了DrawSomething API,并且将其区别于我们主要的API以及框架。我们总是对一些新技术充满好奇,就像此时的我们正研究着Ruby 1.9极其纤程(fiber)以及Event Machine + synchrony。并且为了能够更快解决问题,我们选择了Goliath——来自PostRank的一款非阻塞型ruby应用服务器。在接下来的24小时内我得把key/value代码以及其它支持库存移植到了这个服务器上,并编写一些测试然后激活这一服务。结果正如预想的那样,我们成功地将6个服务器中的115个应用实例分解为只剩下15个应用实例。

但是这种一帆风顺的局面却是短暂的。一些问题开始慢慢浮出水面。这时候我们真的是昼夜不停地工作着,并且在某天的凌晨1点发现了主要问题所在,即我们的云数据储存在这时候发生了90%的请求错误。这之后,我们立刻接收到了来自运营商的邮件,告知我们因为服务器运行达到极限才导致这一问题出现,所以他们不得不开始限制游戏的运行速度。

这时候,服务对我们来说就犹如黑盒子一样,而我们需要努力获得更多控制权。此于我们每秒钟大约可以接收到30幅画,而这对于当时的我们来说真的是个大数目了。所以我们便需要一个全新的后台以规划并处理我们现在的流量。我们曾经在一些小系统中使用过Membase,并认为它能够有效地应用于这款游戏中,所以我们便决定采取这种行动。

我们开始使用Membase(也就是Couchbase)的一个小cluster并重新编写了整款应用,并在当天凌晨3点时激活它。我们的云数据储存问题立刻得到了缓解——尽管我们仍然在使用它将数据到移植新的Couchbase中。通过这种优化,我们的游戏便能够持续有效地向前发展了。

接下来一周对于我们来说也是困难重重,因为其它一些随机的数据储存问题也不断涌现,而与此同时我们还需要去规划其它基础设施部分。这时候,我们需要更加勤奋地投入研究中,并咨询其他能够帮助我们处理这种爆炸式发展的人士。

我曾经与10多个非常聪明且出色的人士进行交谈,包括来自Hunch的Tom Pinckney及其优秀的团队,SocialFlow的Frank Speiser及其团队,Tumblr的Fredrik Nylander,Fastly的Artur Bergman,以及之前在Twitter上遇到的Michael Abbot等。有趣的是尽管我提出了相同的问题,即他们会如何处理这类型的挑战,他们却给予了我各种不同的答案。从而让我们意识到其实我们自己的答案与那些我们所尊敬的团队的答案同样都是有效的。所以我们便继续朝着一开始所设定好的道路前进,并不断思考我们该选择何种技术以及如何执行它们。

尽管使用Couchbase为我们带来了一系列问题,但是我们都认为从当前的基础设施彻底转移到完全不同的平台风险更大。而此时的《Draw Something》每天能够吸引300万至400万玩家。我们联系了Couchbase并获得了一些建议,也就是扩展我们的cluster,利用SSD硬盘驱动器和大量的RAM强化我们的服务器。我们的确这么做了,也就是创造出更多cluster,并基于可扩展内容而进一步分解它们。并且随着流量的不断攀升,我们也持续完善并规划着游戏的后台服务器。现在我们平均每秒钟可以接收上百张画了。

随着游戏的迅速发展,我们的玩家每天也在迅速翻倍增长。而这时候我们需要面对的情况便是如果用户每天成倍增长着,我们的服务器也必须每天翻倍发展。幸好我们的系统高度自动化,而我们也不断创造出更多服务器,使服务器最终能够赶上游戏的发展速度,让游戏能够在100个服务器之间有序地运行着。但是这个问题解决了,前面又出现了另一个瓶颈。

我们做好了要连续大干60多个小时的准备,并且在此期间都不打算离开电脑了。我们必须使用DNS负载均衡器扩展网络服务器,我们必须完成获得更多HAProxies等急迫任务,并且通常都要在深夜落实行动。

非常幸运的是我们的大多数层面都是可扩展的并且不需要做出过多的修改。我们的定制服务器监控工具一直在帮助追踪游戏的加载,储存甚至是游戏的即时用户属性,以帮助我们更好地进行容量规划。最终,我们轻松地推出了应用的cluster,并包含了NGINX,HAProxy以及Goliath服务器(相互独立),并且在上线后它们能够以一种恒速帮助我们提高容量。现在,我们每秒钟便能够接收上千张画了,而一周前看似巨大的流量换做现在也已经变得非常渺小了。

draw something elvi(from gamasutra)

draw something elvi(from gamasutra)

展望未来

OMGPOP所有员工都非常支持我们现在的工作,并且也清楚我们现在所做的事情对于公司发展的重要性。也就是我们得到了公司内部的绝对认可。

难得有游戏能够在如此短的时间内便创造出如此显赫的成绩,也难得有人会为游戏的发展付出如此高的代价。直至今日,《Draw Something》在50天内便创造出了超过5000万次下载量。在最高峰时游戏服务器甚至每秒能够收到大约3千幅画。看到游戏取得的巨大成功,我们能够自豪地说,虽然还存在一些粗糙的修补程序,但是我们有信心保持这款游戏的有效运行。如果这款游戏失败了,我们也不可能迎来如此巨大的发展。

本周我们在《Draw Something》中又添加了一些新的功能,如评论和保存功能——这也是玩家所希望的。

现在,我们已经成为Zynga中的一份子(也就是Zynga Mobile New York),我们能够更加专注于《Draw Something》的完善——而我们也仍然保持着OMGPOP的企业文化。现在,我们甚至计划将该游戏转移到Zynga的zCloud基础设施(游戏邦注:能够有效地协调并处理社交游戏中的工作负荷)。

回首过去几周的发展,我们真的很难想象自己可以从一家原本只有几个员工的小公司而成为Zynga的一份字,并接触到更多专业人士和技术。

可以说,我们终于实现了目标而创造出一款真正热门的游戏。尽管我们遭到了各种挫折,并且也有所延迟,但是我们最终还是创造出一个真正有效的后台去发展我们的这款游戏。

本文为游戏邦/gamerboom.com编译,拒绝任何不保留版权的转载,如需转载请联系:游戏邦

Scale Something: How Draw Something rode its rocket ship of growth

by Jason Pearlman

I’ve worked at OMGPOP for almost four years now and have seen it transform from a dating company, a games company and then find its footing in mobile games. We’ve done tons of stuff, tried many different technologies and business plans.

I’ve always seen us as the little guy that can use fast prototyping, agile development, and the latest tech in order to gain an advantage. Also, being in the game world means you get to test out a lot of different ideas to see what sticks. In my time here, I’ve made avatar systems, a text adventure game engine, a full-featured mysql sharding library, a multiplayer real-time platformer built on our javascript game engine, an AIM bot system, a whole slew of chat room games powered by a bot framework we created, and tons more.

On the backend of all these games is a tiny systems team of three people — myself, Christopher Holt and Manchul Park. We built everything from scratch and thought we had our approach to building and scaling backend systems down pretty well. That was until Draw Something came along.

Countdown

The story of Draw Something’s development actually starts around four years ago when the first version of the game was created on our website OMGPOP.com (then iminlikewithyou.com). It was called “Draw My Thing” and was a real-time drawing game. It was fun and had a somewhat big player base relative to how big our site was at that time. We also made a Facebook version of it, and the game developed a pretty large following.

Last year, we decided to make a mobile version of the game. At that point, OMGPOP was still trying to find its way in the world. We did everything we could to land that hit game. For us, like many developers, it meant working on as many games as possible, and fast. Draw Something was no different. We knew the game had potential, but no one could’ve predicted how big of a hit it would become. From a technical standpoint, we treated Draw Something like a lot of our previous games. The backend team has always built things to be efficient, fast, and to scale.

We’ve learned to keep things simple. The original backend for Draw Something was designed as a simple key/value store with versioning. The service was built into our existing ruby API (using the merb framework and thin web server). Our initial idea was why not use our existing API for all the stuff we’ve done before, like users, signup/login, virtual currency, inventory; and write some new key/value stuff for Draw Something? Since we design for scale, we initially chose Amazon S3 as our data store for all this key/value data. The idea behind this was why not sacrifice some latency but gain unlimited scalability and storage.

The rest of our stack is pretty standard. Anyone who wants to build scalable systems will attempt to make every layer of the system scale independently from the rest. As the web frontend we use NGINX web server, which points to HAProxy software load balancer, which then hits our ruby API running on a thin web server. The main datastore behind this is MySQL – sharded when absolutely necessary. We use memcached heavily and redis for our asynchronous queueing, using the awesome ruby library called resque.

Lift Off

A few days after Draw Something launched, we started to notice something…strange. The game was growing — on its own. And it was growing fast. On release day, it reached 30,000 downloads. About 10 days later, the rocket ship lifted off — downloads accelerated exponentially, soon topping a million.

Celebrities started tweeting about the game – from John Mayer to Pauly D — sending us even more traffic. And people playing the game weren’t leaving — they were hooked, so total usage climbed even higher than the number of people downloading the game every day.

Most engineers develop their software to scale, but know that in any complex system, even if you try to benchmark and test it, it’s hard to tell exactly where things will fall over, in what way, what system changes need to be made, and when.

Ground Control

The first issue we ran into was the fact that our usual API is really fast, which means that using thin web server in the way we always have — single threaded, one request at a time was fine — but for the public cloud, unpredictable response times can back up everything.

So we watched and saw things starting to backup, and knew this was not sustainable. In the meantime, we just continued to bring up more and more servers to buy us some time. Fortunately we had anticipated this and designed the DrawSomething API in such a way that we can easily break it out from our main api and framework. Being always interested in the latest tech out there, we were looking at Ruby 1.9, fibers, and in particular Event Machine + synchrony for a while. Combined with the need for a solution ASAP – this lead us to Goliath, a non-blocking ruby app server written by the guys at PostRank. Over the next 24 hours I ported over the key/value code and other supporting libraries, wrote a few tests and we launched the service live. The result was great. We went from 115 app instances on over six servers to just 15 app instances.

The smooth sailing was short lived. We quickly started seeing spikes and other really strange performance issues. At this point, we were pretty much working around the clock. Things got really bad around 1 a.m. one night, which is when we realized the main issue — our cloud data store was throwing errors on 90 percent of our requests. Shortly after, we received an email from our vendor telling us we were “too hot” and causing issues, so they would have to start rate limiting us.

At this point, the service was pretty much a black box to us, and we needed to gain more control. We were now receiving around 30 drawings per second, a huge number (at least to us at the time). So there we were, 1 a.m. and needing a completely new backend that can scale and handle our current traffic. We had been using Membase for a while for some small systems, and decided that that would make the most sense as it seemed to have worked well for us.

We brought up a small cluster of Membase (a.k.a Couchbase) rewrote the entire app, and deployed it live at 3 a.m. that same night. Instantly, our cloud datastore issues slowed down, although we still relied on it to do a lazy migration of data to our new Couchbase cluster. With these improvements the game continued to grow, onward and upward.

The next week was even more of a blur. Other random datastore problems started to pop up, along with having to scale other parts of the infrastructure. During this time, we were trying to do some diligence and speak to anyone we could about how they would handle our exploding growth.

I must’ve spoken to 10-plus smart, awesome people including Tom Pinckney and his great team from hunch, Frank Speiser and his team from SocialFlow, Fredrik Nylander from Tumblr, Artur Bergman from Fastly, Michael Abbot formerly of Twitter, and many others. The funny part was that for every person I spoke to I got different, yet all equally valid answers — on how they would handle this challenge. All of this was more moral support than anything and made us realize our own answers were just as valid as any of these other teams of whom we have great respect for. So we continued along the path that we started on, and went with our gut on what tech to pick and how to implement it.

Even with the issues we were having with Couchbase, we decided it was too much of a risk to move off our current infrastructure and go with something completely different. At this point, Draw Something was being played by 3-4 million players each day. We contacted Couchbase, got some advice, which really was to expand our clusters, eventually to really beefy machines with SSD hard drives and tons of ram. We did this, made multiple clusters, and sharded between them for even more scalability over the next few days. We were also continuing to improve and scale all of our backend services, as traffic continued to skyrocket. We were now averaging hundreds of drawings per second.

At one point our growth was so huge that our players — millions of them — were doubling every day. It’s actually hard to wrap your head around the fact that if your usage doubles every day, that probably means your servers have to double every day too. Thankfully our systems were pretty automated, and we were bringing up tons of servers constantly. Eventually we were able to overshoot and catch up with growth by placing one order of around 100 servers. Even with this problem solved, we noticed bottlenecks elsewhere.

This had us on our toes and working 24 hours a day. I think at one point we were up for around 60-plus hours straight, never leaving the computer. We had to scale out web servers using DNS load balancing, we had to get multiple HAProxies, break tables off MySQL to their own databases, transparently shard tables, and more. This was all being done on demand, live, and usually in the middle of the night.

We were very lucky that most of our layers were scalable with little or no major modifications needed. Helping us along the way was our very detailed custom server monitoring tools which allowed us to keep a very close eye on load, memory, and even provided real time usage stats on the game which helped with capacity planning. We eventually ended up with easy to launch “clusters” of our app that included NGINX, HAProxy, and Goliath servers all of which independent of everything else and when launched, increased our capacity by a constant. At this point our drawings per second were in the thousands, and traffic that looked huge a week ago was just a small bump on the current graphs.

Looking Ahead

Everyone at OMGPOP was very supportive of our work and fully realized how important what we were doing was for our company. We would walk in to applause, bottles of whiskey on our desk, and positive (but tense) faces.

It’s rare to see growth of this magnitude in such a short period of time. It’s also rare to look under the hood to see what it takes to grow a game at scale. To date, Draw Something has been downloaded more than 50 million times within 50 days. At its peak, about 3,000 drawings are created every second. Along with the game’s success, we’re quite proud to say that although there were a few rough patches, we were able to keep Draw Something up and running. If the game had gone down, our huge growth would’ve come to a dead stop.

This week, we’re thrilled to release some new features in Draw Something like comments and being able to save drawings. Our players have been dying for them.

Now that we’re part of Zynga (technically Zynga Mobile New York), we’re able to re-focus efforts on making Draw Something as good as possible, while still maintaining the culture that makes OMGPOP such a special place. We’re even making plans to move the game over to Zynga’s zCloud infrastructure that’s tuned and built specially to handle workloads for social games.

Looking back at the roller coaster ride of the last few weeks, it’s crazy to think how far we’ve come in such a short period of time. Coming from a small company where a handful of people needed to do everything to Zynga where we have access to the people and the technology needed to grow — from engineers to the zCloud — it’s amazing.

In the end, we finally found our hit game. Despite the late hours, near misses and near meltdowns, we landed on a backend approach that works. Here at OMGPOP we call that drawsome.(source:GAMASUTRA)


上一篇:

下一篇: