Jason Pearlman谈《Draw Something》后台技术发展
但是所有这些游戏的后台系统却是由一个很小的系统团队所支撑，这个团队中只有三名成员，包括Christopher Holt，Manchul Park以及我自己。我们都是白手起家地创造所有内容，并始终坚信我们能够有效地创建并灵活调整游戏的后台系统。直到《Draw Something》的出现。
早在4年前我们便在自己的网站OMGPOP.com（游戏邦注：后来改为iminlikewithyou.com）上推出了《Draw Something》的雏形，那时候叫做《Draw My Thing》，是一款即时绘画游戏。这是一款有趣的游戏，并且拥有较多的玩家基础（主要是受益于我们网站当时不错的发展）。我们同时也创造了这款游戏的Facebook版本，并吸引了许多忠实玩家。
我们已经学会了如何简单做事。最初《Draw Something》的后台被设计为拥有版本控制的简单key/value储存模式。我们将服务嵌入现有的ruby API（使用merb框架以及thin网络服务器）。我们的最初理念是将之前所创造的所有内容用于现有的API中，并为《Draw Something》编写一些新的key/value内容。因为我们是根据比例进行设计，所以我们一开始便选择Amazon S3作为我们所有key/value数据的储存库。而采取这一做法的初衷便是放弃一些等待时间并获得无限量的可扩展性和储存空间。
各种名人开始在twitter上谈论这款游戏——包括John Mayer（游戏邦注：美国创作歌手）和Pauly D，更是为我们的游戏创造了更多的流量。之前的玩家也未离开游戏，他们深深陷入游戏的乐趣中，游戏的总体使用率甚至远远高于它每天的下载量。
虽然我们看到一些内容开始备份，但是这一方法却并非长久有效。同时我们也通过不断地创造出更多服务器而节约时间。幸运的是，我们以此设计出了DrawSomething API，并且将其区别于我们主要的API以及框架。我们总是对一些新技术充满好奇，就像此时的我们正研究着Ruby 1.9极其纤程（fiber）以及Event Machine + synchrony。并且为了能够更快解决问题，我们选择了Goliath——来自PostRank的一款非阻塞型ruby应用服务器。在接下来的24小时内我得把key/value代码以及其它支持库存移植到了这个服务器上，并编写一些测试然后激活这一服务。结果正如预想的那样，我们成功地将6个服务器中的115个应用实例分解为只剩下15个应用实例。
我曾经与10多个非常聪明且出色的人士进行交谈，包括来自Hunch的Tom Pinckney及其优秀的团队，SocialFlow的Frank Speiser及其团队，Tumblr的Fredrik Nylander，Fastly的Artur Bergman，以及之前在Twitter上遇到的Michael Abbot等。有趣的是尽管我提出了相同的问题，即他们会如何处理这类型的挑战，他们却给予了我各种不同的答案。从而让我们意识到其实我们自己的答案与那些我们所尊敬的团队的答案同样都是有效的。所以我们便继续朝着一开始所设定好的道路前进，并不断思考我们该选择何种技术以及如何执行它们。
现在，我们已经成为Zynga中的一份子（也就是Zynga Mobile New York），我们能够更加专注于《Draw Something》的完善——而我们也仍然保持着OMGPOP的企业文化。现在，我们甚至计划将该游戏转移到Zynga的zCloud基础设施（游戏邦注：能够有效地协调并处理社交游戏中的工作负荷）。
Scale Something: How Draw Something rode its rocket ship of growth
by Jason Pearlman
I’ve worked at OMGPOP for almost four years now and have seen it transform from a dating company, a games company and then find its footing in mobile games. We’ve done tons of stuff, tried many different technologies and business plans.
On the backend of all these games is a tiny systems team of three people — myself, Christopher Holt and Manchul Park. We built everything from scratch and thought we had our approach to building and scaling backend systems down pretty well. That was until Draw Something came along.
The story of Draw Something’s development actually starts around four years ago when the first version of the game was created on our website OMGPOP.com (then iminlikewithyou.com). It was called “Draw My Thing” and was a real-time drawing game. It was fun and had a somewhat big player base relative to how big our site was at that time. We also made a Facebook version of it, and the game developed a pretty large following.
Last year, we decided to make a mobile version of the game. At that point, OMGPOP was still trying to find its way in the world. We did everything we could to land that hit game. For us, like many developers, it meant working on as many games as possible, and fast. Draw Something was no different. We knew the game had potential, but no one could’ve predicted how big of a hit it would become. From a technical standpoint, we treated Draw Something like a lot of our previous games. The backend team has always built things to be efficient, fast, and to scale.
We’ve learned to keep things simple. The original backend for Draw Something was designed as a simple key/value store with versioning. The service was built into our existing ruby API (using the merb framework and thin web server). Our initial idea was why not use our existing API for all the stuff we’ve done before, like users, signup/login, virtual currency, inventory; and write some new key/value stuff for Draw Something? Since we design for scale, we initially chose Amazon S3 as our data store for all this key/value data. The idea behind this was why not sacrifice some latency but gain unlimited scalability and storage.
The rest of our stack is pretty standard. Anyone who wants to build scalable systems will attempt to make every layer of the system scale independently from the rest. As the web frontend we use NGINX web server, which points to HAProxy software load balancer, which then hits our ruby API running on a thin web server. The main datastore behind this is MySQL – sharded when absolutely necessary. We use memcached heavily and redis for our asynchronous queueing, using the awesome ruby library called resque.
A few days after Draw Something launched, we started to notice something…strange. The game was growing — on its own. And it was growing fast. On release day, it reached 30,000 downloads. About 10 days later, the rocket ship lifted off — downloads accelerated exponentially, soon topping a million.
Celebrities started tweeting about the game – from John Mayer to Pauly D — sending us even more traffic. And people playing the game weren’t leaving — they were hooked, so total usage climbed even higher than the number of people downloading the game every day.
Most engineers develop their software to scale, but know that in any complex system, even if you try to benchmark and test it, it’s hard to tell exactly where things will fall over, in what way, what system changes need to be made, and when.
The first issue we ran into was the fact that our usual API is really fast, which means that using thin web server in the way we always have — single threaded, one request at a time was fine — but for the public cloud, unpredictable response times can back up everything.
So we watched and saw things starting to backup, and knew this was not sustainable. In the meantime, we just continued to bring up more and more servers to buy us some time. Fortunately we had anticipated this and designed the DrawSomething API in such a way that we can easily break it out from our main api and framework. Being always interested in the latest tech out there, we were looking at Ruby 1.9, fibers, and in particular Event Machine + synchrony for a while. Combined with the need for a solution ASAP – this lead us to Goliath, a non-blocking ruby app server written by the guys at PostRank. Over the next 24 hours I ported over the key/value code and other supporting libraries, wrote a few tests and we launched the service live. The result was great. We went from 115 app instances on over six servers to just 15 app instances.
The smooth sailing was short lived. We quickly started seeing spikes and other really strange performance issues. At this point, we were pretty much working around the clock. Things got really bad around 1 a.m. one night, which is when we realized the main issue — our cloud data store was throwing errors on 90 percent of our requests. Shortly after, we received an email from our vendor telling us we were “too hot” and causing issues, so they would have to start rate limiting us.
At this point, the service was pretty much a black box to us, and we needed to gain more control. We were now receiving around 30 drawings per second, a huge number (at least to us at the time). So there we were, 1 a.m. and needing a completely new backend that can scale and handle our current traffic. We had been using Membase for a while for some small systems, and decided that that would make the most sense as it seemed to have worked well for us.
We brought up a small cluster of Membase (a.k.a Couchbase) rewrote the entire app, and deployed it live at 3 a.m. that same night. Instantly, our cloud datastore issues slowed down, although we still relied on it to do a lazy migration of data to our new Couchbase cluster. With these improvements the game continued to grow, onward and upward.
The next week was even more of a blur. Other random datastore problems started to pop up, along with having to scale other parts of the infrastructure. During this time, we were trying to do some diligence and speak to anyone we could about how they would handle our exploding growth.
I must’ve spoken to 10-plus smart, awesome people including Tom Pinckney and his great team from hunch, Frank Speiser and his team from SocialFlow, Fredrik Nylander from Tumblr, Artur Bergman from Fastly, Michael Abbot formerly of Twitter, and many others. The funny part was that for every person I spoke to I got different, yet all equally valid answers — on how they would handle this challenge. All of this was more moral support than anything and made us realize our own answers were just as valid as any of these other teams of whom we have great respect for. So we continued along the path that we started on, and went with our gut on what tech to pick and how to implement it.
Even with the issues we were having with Couchbase, we decided it was too much of a risk to move off our current infrastructure and go with something completely different. At this point, Draw Something was being played by 3-4 million players each day. We contacted Couchbase, got some advice, which really was to expand our clusters, eventually to really beefy machines with SSD hard drives and tons of ram. We did this, made multiple clusters, and sharded between them for even more scalability over the next few days. We were also continuing to improve and scale all of our backend services, as traffic continued to skyrocket. We were now averaging hundreds of drawings per second.
At one point our growth was so huge that our players — millions of them — were doubling every day. It’s actually hard to wrap your head around the fact that if your usage doubles every day, that probably means your servers have to double every day too. Thankfully our systems were pretty automated, and we were bringing up tons of servers constantly. Eventually we were able to overshoot and catch up with growth by placing one order of around 100 servers. Even with this problem solved, we noticed bottlenecks elsewhere.
This had us on our toes and working 24 hours a day. I think at one point we were up for around 60-plus hours straight, never leaving the computer. We had to scale out web servers using DNS load balancing, we had to get multiple HAProxies, break tables off MySQL to their own databases, transparently shard tables, and more. This was all being done on demand, live, and usually in the middle of the night.
We were very lucky that most of our layers were scalable with little or no major modifications needed. Helping us along the way was our very detailed custom server monitoring tools which allowed us to keep a very close eye on load, memory, and even provided real time usage stats on the game which helped with capacity planning. We eventually ended up with easy to launch “clusters” of our app that included NGINX, HAProxy, and Goliath servers all of which independent of everything else and when launched, increased our capacity by a constant. At this point our drawings per second were in the thousands, and traffic that looked huge a week ago was just a small bump on the current graphs.
Everyone at OMGPOP was very supportive of our work and fully realized how important what we were doing was for our company. We would walk in to applause, bottles of whiskey on our desk, and positive (but tense) faces.
It’s rare to see growth of this magnitude in such a short period of time. It’s also rare to look under the hood to see what it takes to grow a game at scale. To date, Draw Something has been downloaded more than 50 million times within 50 days. At its peak, about 3,000 drawings are created every second. Along with the game’s success, we’re quite proud to say that although there were a few rough patches, we were able to keep Draw Something up and running. If the game had gone down, our huge growth would’ve come to a dead stop.
This week, we’re thrilled to release some new features in Draw Something like comments and being able to save drawings. Our players have been dying for them.
Now that we’re part of Zynga (technically Zynga Mobile New York), we’re able to re-focus efforts on making Draw Something as good as possible, while still maintaining the culture that makes OMGPOP such a special place. We’re even making plans to move the game over to Zynga’s zCloud infrastructure that’s tuned and built specially to handle workloads for social games.
Looking back at the roller coaster ride of the last few weeks, it’s crazy to think how far we’ve come in such a short period of time. Coming from a small company where a handful of people needed to do everything to Zynga where we have access to the people and the technology needed to grow — from engineers to the zCloud — it’s amazing.
In the end, we finally found our hit game. Despite the late hours, near misses and near meltdowns, we landed on a backend approach that works. Here at OMGPOP we call that drawsome.(source:GAMASUTRA)