游戏邦在:
杂志专栏:
gamerboom.com订阅到鲜果订阅到抓虾google reader订阅到有道订阅到QQ邮箱订阅到帮看

阐述受数据驱动设计的现在与未来

发布时间:2013-10-24 17:34:30 Tags:,,,,

作者:Noel

去年我写了一篇有关受数据驱动的设计的基本原理的文章。那时候,受数据驱动的设计已经对游戏开发产生了很大的影响,并且许多团队也都是基于数据去衡量更多性能临界系统。快速回顾过去,我们可以发现受数据驱动的设计的主要目标是在现代硬件平台上获得较高性能。明确地来说,这便意味着好好利用内存访问,多个核心并删除任何不需要的代码。受数据驱动的设计的副作用则是代码会趋于模块化并且更容易进行测试。

受数据驱动的设计专注于生成有用的输入数据和输出数据。代码并不是我们需要专注的内容(就像传统的计算机科学),但却是我们需要基于有效的方式写下并将输入数据转变为输出数据的内容。在现代硬件中,这通常意味着在较大且连续的同类内存组块中使用同样的代码。

使用受数据驱动的设计

data(from socialbeta)

data(from socialbeta)

我们很容易在独立系统(已经研究过同类数据)中使用这些理念。游戏中的大多数粒子系统都是基于该方法进行设计,因为它们的一大主要目标便是趋于效率性并基于较高帧率处理大量粒子。声音处理是另外一种将数据摆在首位的系统。

所以是什么阻止我们在游戏代码基础上将其应用于所有的性能敏感系统上?最有可能的原因应该是我们对于代码的看法。我们需要准备好真正着眼于数据并愿意将代码分为不同阶段。让我们着眼于一个较高级的例子并判断在优化数据访问时该如何重新调整代码。

列表1呈现的是关于什么是通用游戏AI典型更新函数的伪代码。更糟糕的是,该函数可能是虚构的,且不同实体类型也许会基于不同方式去执行它。让我们暂且忽视这点,并专注于它会做些什么。该伪代码强调的是(作为实体更新的一部分)它做了许多有条件的光线投射查询,并更新了一些基于这些查询的结果的状态。换句话说,我们将面临经典的树层代码结构,即对象导向型编程中常出现的那样。

void AIEntity::Update(float dt)
{
DoSomeProcessing();
if (someCondition && Raycast(world))
DoSomething();
if (someOtherCondition && BunchOfRayCasts(world))
DoSomethingElse();
UpdateSomeOtherStuff();
}

列表1

光线投射是游戏实体中最常见的操作方式。这是关于它们如何“看到”周边的一切,也因此让它们能对周边的事物做出真正的反应。不幸的是,光线投射是一种非常重量级的操作方式,它将潜在访问内存中许多不同的领域:一个空间数据结构,其它实体代表,以及碰撞网络中的多边形等等。

此外,实体更新函数很难平行出现在多个核心上。我们不确定在该函数上有多少数据被读过或写过,并且有些数据(就像世界数据结构)可能会特别复杂且昂贵而难以保护来自多个线程的更新。

如果我们重新组织内容,便有可能彻底完善性能与平衡。

分解并分批处理

没有看到实体更新中的任何细节,我们可以发现光线投射就像位于中心的大拇指般的存在。光线投射操作是独立于任何与实体相关的内容,它是重量级的,可能有许多这样的存在,所以它是分解成一个单独步骤的最佳选择。

列表2呈现了被分解的实体更新代码是怎样的。现在更新被分解成两个不同的关口:第一个关口的更新是独立于任何光线投射,并决定在某一时刻执行光线投射。

void AIEntity::InitialUpdate(float dt, RayCastQueries& queries)
{
DoSomeProcessing();
if (someCondition)
AddRayCastQuery(queries);
if (someOtherCondition)
AddBunchOfRayCasts(queries);
}

void AIEntity::FinalUpdate(const RayCastResults& results)
{
UpdateSomeOtherStuff(results);
}

列表2

游戏代码控制着游戏分批处理所有AI实体过程的更新(列表3)。所以比起面向每个实体调用InitialUpdate(),解决光线投射,和FinalUpdate(),它在所有AI实体调用InitialUpdate()中迭代着并添加所有的光线投射查询请求到输出数据中。当它收集了所有光线投射查询时,它便会立即处理它们并保存它们的结果。最后,它将进行最后的传递并在每个实体上基于光线投射结果调用FinalUpdate()。

RayCastQueries queries;
for (int i=0; i<entityCount; ++i)
entities[i].InitialUpdate(dt, queries);

// Other update that might need raycasts

RayCastResults results;
for (int i=0; i<queries.count; ++i)
PerformRayCast(queries[i], results);

for (int i=0; i<entityCount; ++i)
entities[i].FinalUpdate(results);

列表3

通过将光线投射调用从实体更新函数中删除,我们便大大缩短了调用数。基于更有效的缓存使用,函数往往更加独立,更容易理解且更有效率。你可以通过发送所有的光线投射到一个核心上而看到它是如何更轻松地平衡内容,与此同时另外一个核心正忙于更新一些不相干的内容(或者你也可以通过在多个核心上传播光线投射,或依赖于你的粒子级别)。

可以注意到在所有实体上调用InitialUpdate()后,我们还可以处理其它需要光线投射查询的游戏对象并收集它们。基于该方法,我们可以分批处理所有的光线投射并同时计算它们。多年来我们一直受到图像硬件制造商的影响去看待如何分批处理自己的渲染调用并避免绘制个体多边形。这也是基于同样的方式:通过在一次单独调用中分批处理所有光线投射,我们将能够获得更高的性能。

分解

我们是否通过调整代码而获得更好的结果?我们2次忽略了AI实体,所以从内存角度看来这是否变得更加糟糕了?最终你需要对其进行策略并比较2个结果。在现代硬件平台上,我之所以期待性能会更好是因为尽管我们穿越了实体2次,我们正在使用的是更好的代码缓存,我们正在按顺序地访问它们(这让我们能够预取下一个代码)。

如果这是我们对实体更新所做出的唯一改变,那么剩下的代码通常便是较深入的树层代码,而因为在每次更新时我们都会面临缓存限制,所以我们便不可能获得更多结果。也许我们需要在剩下的更新函数中使用同样的设计原则而见证性能的完善。但是最起码的我们让它变得能够更轻松地实现平衡。

我们现在所获得的便是修改自己的数据去适应我们所使用的方法的能力,这是获得较大性能的关键。举个例子来说,在看到实体是如何在两个独立传递间更新时,你也许会注意到只有一些储存于实体对象中的数据是源自第一次更新,而第二次传递访问了更多特殊数据。

在那时我们可以将实体类分解成两个不同的数据组。这种情况下最复杂的一点便是基于某种有意义的方式为这些数据组命名。它们并不能代表真正的对象或真实的理念,而是代表不同的理念元素,即纯粹基于数据处理方法而进行分解。这也是为何之前的AIEntity现在会变成EntityInfo(包含像位置,方向和一些高级数据等内容)以及AIState(带有当前目标,顺序,路径,目标敌人等等)。

现在整体更新在第一次传递中处理EntityInfo结构并在第二次传递中处理AIState结构,让它变得更具缓存友好型且更有效能。

基于现实意义,不管是第一次传递还是第二次传递都必须访问所有共同数据(如实体当前状态:消失,被占用,探索,闲置等等)。如果它只是一小部分的数据,那么最佳解决方法便是基于这两种结构复制该数据(游戏邦注:违背计算机科学中所有“常见的理解”)。如果常见数据更大或是可读写,那么赋予它独立的数据结构便会更有意义。

这时候便会出现不同的复杂性:追踪不同结构的所有关系。这不仅具有挑战性同时也具有调试性,因为一些属于通过逻辑实体的数据并未被储存在同样的结构中,所以我们很难在一个调试器中进行探索。即便如此,我们也应该好好利用指标和控件将这些问题变得更容易得到控制。

条件执行

到目前为止事情还相对简单,因为我们我们正在假设每个AI实体既需要更新也需要一些光线投射。这并不是很现实,因为实体具有突发性:有时候它们需要许多光线投射,有时候它们是闲置的或遵循着一些顺序,甚至有时候会不被需要。我们可以通过在第二个更新函数中添加条件执行去处理这种情况。

关于有条件地执行更新的最简单的方法便是添加额外的输出参数到FirstUpdate()函数去明确实体是否需要第二次更新。同样的信息可以基于是否添加了任何光线投射查询而派生调用代码。如此在第二次传递中,我们只能更新那些出现在需要第二次更新的实体列表上的实体。

关于这一方法的最大缺陷在于第二次更新是源自直接通过内存而忽视了实体,并潜在影响了缓存的性能。所以我们所想的便是做到性能优化并最终让所有内容变得更慢。除非我们获得较大的性能完善,否则面向所有实体(不管它们需不需要)执行这一做法通常会更好。然而如果平均不到10%或20%的实体需要光线投射,我们便值得避免在所有其它实体上进行第二次更新,并避免支付条件执行罚金。

如果在第二次传递中更新的实体数量较小,那么另一种方法便是从第一次传递中复制所有必要数据到新的临时缓存中。第二次传递继续处理该数据而不需要支付任何性能罚金,它还将完全抵消复制数据的性能影响。

最后,如果条件执行在帧与帧之间仍保持一样,那么另一种替代选择便是重新安置需要光线投射的实体。该方法的复制是最低限度(当需要光线投射时将实体调换到数组的新位置上),而我们也仍能从连续的第二次更新中获益。为了做到这点你的所有实体就需要进行重新定位,这便意味着处理空间或其它间接性,或为那些交换了位置的实体更新所有参考。

不同模式

如果实体是在一些完全不同的执行模式中会是怎样的情况?即使这是同样的实体类型,直接转变它们去调用更新函数将导致最终使用的是完全不同的代码,从而创造出糟糕的代码缓存性能。

我们可以在如下情况下使用一些方法:

如果不同的执行模式与实体数据的不同部分维系在一起,我们便能够认为它们是完全不同的实体并分解它们的数据组件。基于这一方法,我们可以分别在每种类型上进行迭代并获得所有性能利益。

如果数据大多数是相同的,只是代码发生改变,我们便可以将所有实体保存在同样的内存组块中,并重新排列它们使得同样模式的实体是彼此相邻的。再次,如果你可以重新安置数据,这便是一种非常简单且有效的做法(游戏邦注:这只需要在状态改变时交换一些实体)。

别管它!最终,受数据驱动的设计是关于思考数据以及它是如何影响你的程序。这并不意味着你必须优化每个元素,特别是当你所获得的结果不足以保证更多的复杂性时。

未来

基于数据思考程序以及执行这些优化是否真的在好好利用我们的时间?随着硬件的完善所有的这些执行在不久后的未来是否会消失?就现在而言,答案绝对是否定。伴随着单独CPU的有效内存访问是一个非常复杂的问题,而当我们添加更多核心时,问题会变得更加复杂。同样的,CPU中晶体管的数量的提高将比内存访问时间更快。这便告诉我们,除非出现新的技术,否则我们将长期面对这一问题。这是我们现在需要立刻解决的问题,并围绕着它去创造自己的技术。

我希望在未来能够看到使受数据驱动的设计变得更加简单的内容。我们可以创造一种新的语言去成就更有效的内存访问并实现更简单的平衡,但是取代C/C++和所有现有的库存却非常困难。回顾历史,游戏技术中的进步一直持上升趋势,并且也未曾丢弃现有的语言,工具和库存(这也是我们为何在今天仍使用着C++的主要原因)。

现在我们可以做的有两件事。我知道许多开发者正致力于他们项目中类似的系统,但是如果能让一些常见的执行出现在公众面前,我们便能够基于它们进行创造。

语言

尽管函数语言是最理想的(不管是从头开始创造或重新使用现有的语言),我们可以暂时扩展C去适应自己的需求。我想看到一套C扩展,即函数拥有明确定义的输入和输出内容,函数中的代码并不能访问任何全局状态或调用函数外部的任何代码(除了基于同样范围而定义的局部辅助函数)。这可以作为一个预处理器或改良过的C编辑器,所以它与现有库存和代码是相容的。

函数间的依赖性将通过维系某些函数的输出和其它函数的输入传达出来。这可以在代码中实现或通过使用GUI工具(帮助开发者从视觉上管理数据关系)实现。基于该方法我们可以建造每一帧中包含的所有函数的独立图表。

调度程序

当我们对每个函数具有依赖性时,我们便可以从中创造一个有向非循环图(DAG),这将为我们呈现数据在每一帧中是如何处理的全局视图。那时候,比起手动运行函数,我们可以将其留给调度程序完成。

调度程序拥有有关所有函数的信息以及可用核心的数量(以及之前执行帧的信息)。它可以通过DAG决定主要路径并优化任务的调度,如此便能够一直朝着主要路径前进。如果临时的内存缓存是我们平台的局限因素,那么调度程序便会考虑到这点并为减少的内存占用交易一些性能时间。

与语言一样,调度程序是一个非常通用的组件,也能够呈现在公众面前。开发者可以将其作为一个起点,基于它进行创造,并面向自己的特定游戏和平台添加规则。

即使我们并未准备创造哪些可再用的组件,但是任何致力于创造高性能游戏的开发者都应该立即思考游戏中的相关数据。随着下一代主机和计算机的蜂拥而至,数据将在该领域的未来扮演着更加重要的角色。

原文发表于2011年1月4日,所涉事件及数据均以当时为准。

本文为游戏邦/gamerboom.com编译,拒绝任何不保留版权的转载,如需转载请联系:游戏邦

Data-Oriented Design Now And In The Future

Posted on January 4, 2011 by Noel

Last year I wrote about the basics of Data-Oriented Design (see the September 2009 issue of Game Developer). In the time since that article, Data-Oriented Design has gained a lot of traction in game development and many of teams are thinking in terms of data for some of the more performance-critical systems.
As a quick recap, the main goal of Data-Oriented Design is achieving high-performance on modern hardware platforms. Specifically, that means making good use of memory accesses, multiple cores, and removing any unnecessary code. A side effect of Data-Oriented Design is that code becomes more modular and easier to test.

Data-Oriented Design concentrates on the input data available and the output data that needs to be generated. Code is not something to focus on (like traditional Computer Science), but is something that is written to transform the input data into the output data in an efficient way. In modern hardware, that often means applying the same code to large, contiguous blocks of homogeneous memory.

Applying Data-Oriented Design

It’s pretty easy to apply these ideas to a self-contained system that already works over mostly-homogeneous data. Most particle systems in games are probably designed that way because one of their main goals is to be very efficient and handle a large number of particles at high framerates. Sound processing is another system that is naturally implemented thinking about data first and foremost.

So, what’s stopping us from applying it to all the performance-sensitive systems in a game code base? Mostly just the way we think about the code. We need to be ready to really look at the data and be willing to split up the code into different phases. Let’s take a high-level example and see how the code would have to be restructured when optimizing for data access.

Listing 1 shows pseudocode for what could be a typical update function for a generic game AI. To make things worse, that function might even be virtual and different types of entities might implement it in different ways. Let’s ignore that for now and concentrate on what it does. In particular, the pseudocode highlights that, as part of the entity update, it does many conditional ray casting queries, and also updates some state based on the results of those queries. In other words, we’re confronted with the typical tree-traversal code structure so common in Object-Oriented Programming.

Listing 1

Ray casts against the world are a very common operation for game entities. That’s how they “see” what’s around them and that’s what allows them to react correctly to their surroundings. Unfortunately, ray casting is a very heavy weight operation, and it involves potentially accessing many different areas in memory: a spatial data structure, other entity representations, polygons in a collision mesh, etc.

Additionally, the entity update function would be very hard to paralellize on multiple cores. It’s unclear how much data is read or written in that function, and some of that data (like the world data structure) might be particularly hard and expensive to protect from updates from multiple threads.

If we re-organize things a bit, we can significantly improve performance and paralellization.

Break Up And Batch

Without seeing any of the details of what’s going on inside the entity update, we can see the raycasts sticking out like a sore thumb in the middle. A raycast operation is fairly independent of anything else related to the entity, it’s heavyweight, and there could be many of them, so it’s a perfect candidate to break up into a separate step.

Listing 2 shows how the broken up entity update code would look like. The update is now split in two different passes: The first pass does some of the updating that can be done independently of any ray casts, and decides which, if any, raycasts need to be performed sometime this frame.

Listing 2

The game code in charge of updating the game processes all AI entities in batches (Listing 3). So instead of calling InitialUpdate(), solve ray casts, and FinalUpdate() for each entity, it iterates over all the AI entities calling InitialUpdate() and adds all the raycast query requests to the output data. Once it has collected all the raycast queries, it can process them all at once and store their results. Finally, it does one last pass and calls FinalUpdate() with the raycast results on each entity.

Listing 3

By removing the raycast calls from within the entity update function, we’ve shortened the call tree significantly. The functions are more self-contained, easier to understand, and probably much more efficient because of better cache utilization. You can also see how it would be a lot easier to parallelize things now by sending all raycasts to one core while another core is busy updating something unrelated (or maybe by spreading all raycasts across multiple cores, depending on your level of granularity).

Note that after calling InitialUpdate() on all entities, we could do some processing on other game objects that might also need raycast queries and collect them all. That way, we can batch all the raycasts at compute them all at once. For years, we’ve been drilled by graphics hardware manufacturers how we should batch our render calls and avoid drawing individual polygons. This is the same way: By batching all raycasts in a single call, we have the potential to achieve much higher performance.

Splitting Things Up

Have we really gained much by reorganizing the code this way? We’re doing two full passes over the AI entities, so wouldn’t that be worse from a memory point of view? Ultimately you need to measure it and compare the two. In modern hardware platforms, I would expect performance to be better because, even though we’re traversing through the entities twice, we’re using the code cache much better and we’re accessing them sequentially (which allows us to pre-fetch the next one too).

If this is the only change we make to the entity update, and the rest of the code is the usual deep tree traversal code, we might not have gained much because we’re still blowing the cache limits with every update. We might need to apply the same design principles to the rest of the update function to start seeing performance improvements. But at the very least, even with this small change, we have made it easier to parallelize.

One thing we’ve gained now is the ability to modify our data to fit the way we’re using it, and that’s the key to big performance gains. For example, after seeing how the entity is updated in two separate passes, you might notice that only some of the data that was stored in the entity object is touched from the first update, and the second pass accesses more specific data.

At that point we can split up the entity class into two different sets of data. One of the most difficult things at this point is naming these sets data in some meaningful way. They’re not representing real objects or real-world concepts anymore, but different aspects of a concept, broken down purely by how the data is processed. So what before was an AIEntity, can now become a EntityInfo (containing things like position, orientation, and some high-level data) and AIState (with the current goals, orders, paths to follow, enemies targeted, etc).

The overall update function now deals with EntityInfo structures in the first pass, and AIState structures in the second pass, making it much more cache friendly and efficient.

Realistically, both the first and second passes will have to access some common data (for example the entity current state: fleeing, engaged, exploring, idle, etc). If it’s only a small amount of data, the best solution might be to simply duplicate that data on both structures (going against all “common wisdom” in Computer Science). If the common data is larger or is read-write, it might make more sense to give it separate data structure of its own.

At this point, a different kind of complexity is introduced: Keeping track of all the relationships from the different structures. This can be particularly challenging while debugging because some of the data belonging to the same logical entity isn’t stored in the same structure and it’s harder to explore in a debugger. Even so, making good use of indices and handles makes this problem much more manageable (see Managing Data Relationships in the September 2008 issue of Game Developer).

Conditional Execution

So far things are pretty simple because we’re assuming that every AI entity needs both updates and some ray casts. That’s not very realistic because entities are probably very bursty: sometimes they need a lot of ray casts, and sometimes they’re idle or following orders and don’t need any for a while. We can deal with this situation by adding a conditional execution to the second update function.

The easiest way to conditionally execute the update would be to add an extra output parameter to the FirstUpdate() function indicating whether the entity needs a second update or not. The same information could be derived form the calling code depending on whether there were any raycasts queries added. Then, in the second pass, we only update those entities that appear in the list of entities requiring a second update.

The biggest drawback of this approach is that the second update went from traversing memory linearly to skipping over entities, potentially affecting cache performance. So what we thought was going to be a performance optimization ended up making things slower. Unless we’re gaining a significant performance improvement, it’s often better to simply do the work for all entities whether they need it or not. However, if on average less than 10 or 20 percent of the entities need a ray cast, then it might be worth avoiding doing the second update on all the other entities and paying the conditional execution penalty.

If the number of entities to be updated in the second pass is fairly small, another approach would be to copy all necessary data from the first pass into a new temporary buffer. The second pass can then process that data sequentially without any performance penalties and it would completely offset the performance hit of copying the data.

Finally, another alternative, especially if the conditional execution remains fairly similar from frame to frame, is to relocate entities that need raycasting together. That way the copying is minimal (swapping an entity to a new location in the array whenever it needs a raycast), and we still get the benefit of the sequential second update. For this to work all your entities need to be fully relocatable, which means working with handles or some other indirection, or updating all the references to the entities that swapped places.

Different Modes

What if the entity can be in several, totally different modes of execution? Even if it’s the same type of entity, traversing through them linearly calling the update function could end up using completely different code for each of them, so it will result in poor code cache performance.

There are several approached we can take in a situation like that:

If the different execution modes also are tied to different parts of the entity data, we could treat them as if they were completely different entities and break each of their data components apart. That way, we can iterate through each type separately and get all the performance benefits.

If the data is mostly the same, and it’s just the code that changes, we could keep all the entities in the same memory block, but rearrange them so that entities in the same mode are next to each other. Again, if you can relocate your data, this is very easy and efficient (it only requires swapping a few entities whenever the state changes).

Leave it alone! Ultimately, Data-Oriented Design is about thinking about the data and how it affects your program. It doesn’t mean you always have to optimize every aspect of it, especially if the gains aren’t significant enough to warrant the added complexity.

The Future

Is thinking about a program in terms of data and doing these kind of optimizations a good use of our time? Is this all going to go away in the near future as hardware improves? As far as we can tell right now, the answer is a definite no. Efficient memory access with a single CPU is a very complicated problem, and matters get much worse as we add more cores. Also, the amount of transistors in CPUs (which is a rough measure of power) continues to increase much faster than memory access time. That tells us that, barring new technological breakthroughs, we’re going to be dealing with this problem for a long time. This is a problem we need to deal with right now and build our technology around it.

There are some things I’d like to see in the future to make Data-Oriented Design easier. We can all dream up of a new language that will magically allow for great memory access and easy paralellization, but replacing C/C++ and all existing libraries is always going to be a really hard sell. Historically, the best advances in game technology have been incremental, not throwing away existing languages, tools, and libraries (that’s why we’re still stuck with C++ today).

Here are two things that could be done right now and work with our existing codebases. I know a lot of developers are working on similar systems in their projects, but it would be great to have a common implementation released publicly so we can all build on top of them.

Language

Even though a functional language might be ideal, either created from scratch or reusing an existing one, we could temporarily extend C to fit our needs. I would like to see a set of C extensions where functions have clearly defined inputs and outputs, and code inside a function is not allowed to access any global state or call any code outside that function (other than local helper functions defined in the same scope). This could be done as a preprocessor or a modified C compiler, so it remains very compatible with existing libraries and code.

Dependencies between functions would be expressed by tying the outputs of some functions to the input of other functions. This could be done in code or through the use of GUI tools that help developers manage data relationships visually. That way we can construct a dependency diagram of all the functions involved in every frame.

Scheduler

Once we have the dependencies for every function, we can create a directed acyclic graph (DAG) from it, which would give us a global view of how data is processed every frame. At that point, instead of running functions manually, we can leave that job in the hands of a scheduler.

The scheduler has full information about all the functions as well as the number of available cores (and information from the previous frame execution if we want to use that as well). It can determine the critical path through the DAG and optimize the scheduling of the tasks so the critical path is always being worked on. If temporary memory buffers are a limitation for our platform, the scheduler can take that into account and trade some performance time for a reduced memory footprint.

Just like the language, the scheduler would be a very generic component, and could be made public. Developers could use it as a starting point, build on top of it, and add their own rules for their specific games and platforms.

Even if we’re not ready to create those reusable components, every developer involved in creating high-performance games should be thinking about data in their games right now. Data is only going to get more important in the future as the next generation of consoles and computers rolls in.(source:gamesfromwithin)


上一篇:

下一篇: