游戏邦在:
杂志专栏:
gamerboom.com订阅到鲜果订阅到抓虾google reader订阅到有道订阅到QQ邮箱订阅到帮看

阐述游戏多线编程方法的定义及作用

发布时间:2014-03-18 15:14:17 Tags:,,,,

作者:Joseph Simons

本文是我在电子游戏开发过程中对多线编程开发方法的一点心得,目标是解释多线编程在高低水准下的工作原理,以及探讨能够使用这些技巧的通见情况。我甚至可能触及一些你可以自己试试的情况,以及某些对其进行调试的技巧。希望你阅读愉快并从中得到收获。

什么是多线程技术?

让我们先从什么是线程,以及使用更多线程的好处开始吧。

每个在现代计算机(例如你的PC、游戏主机或智能手机)上运行的程序都是一个过程。这些程序可能拥有子过程,但为了简便起见,我就不再赘述这些子过程了。这些过程可能含有大量与之相关的线程。这些线程负责执行运行程序的实际代码。线程可以独立运行代码但会共享内存。这允许它们在相同数据上执行操作,但执行的不同的计算。事实上它们共享的内存是其功能的强大双刃剑。

现代CPU通常是由数个核心组成。每个核心一次可运行一个线程(但超线程可以交错运行两个,但我在此不对其加以介绍了)。含有单个核心的计算机模拟多任务的方法就是让操作系统在线程之间随意切换。所以虽然单一线程具有执行性,多个线程却有可能让一切同时运行。这意味着如果你的程序想充分利用人铁CPU所拥有的计算资源,你就不会让线程的数量超过核心数量,因为你想在自己的线程执行之间进行切换。这种在执行线程之间的切换就是所谓的情境切换,虽然它的成本并不会过于昂贵,但为了实现理想运行速度,你仍然要尽量避免这种情况。

一个线程的主要部分是程序计数器以及其他寄存器、堆栈以及任何相关的本地数据存储。程序计数器会追踪当前在执行哪一部分代码。寄存器则追踪当前执行代码的值。堆栈则托管那些与寄存器不适应的值。本地数据是程序指定的,但通常会追踪该线程的局部数据。本地存储会通过查找表来完成,由操作系统来管理,让每个线程来查找指定索引以便找到数据存储之处。简而言之,这可以让你访问看似全局变量,但实际上是针对各个线程的特定变量。

鉴于这些考虑,多线编程就是你在程序中使用多个线程。通常情况下,使用这一方法可以比使用单个线程更快地执行程序。另一个通用方法是将你的UI同程序的其他元素相隔开,这样它就会一直具有响应性。其难度在于要用这种方法设计系统,以便它们能够以这些方式利用多个线程。

为什么需要多线编程?

简而言之,我们需要这种方法是为了最大化地利用可用资源。但真正的问题在于,“为什么我们现在需要使用它?”这与计算发展趋势有关。

自CPU问世以来,硬件制造商也不断增加计算机的运行速度。但是在不到十年之前,消费者产品达到了4GHz。这很大程度上与电迁移效应有关,因为CPU已经变得越来越小,其频率也随着释放更多热量而不断增长。虽然我们当前的消费者产品处理器已经接近于5GHz,但这是芯片设计师需要考虑的问题。这意味着其中一个关键创新是将更多CPU核心引进单一芯片。所以你不能再寄希望于游戏能够通过新硬件而运行得更快,而要在游戏设计上下功夫,让它在多核心上运行多个线程。

多线程计算要求游戏适应的首个领域之一就是微软Xbox 360和Playstation 3的发布,有两个相似但又不同的原因。360拥有三个采用超级线程技术(HTT)的核心,这意味着它可以合理地同行运行6个线程。而PS3则拥有一个HTT核心,7个协同处理元素(SPE),采用了一个称为Cell Broadband Engine的新结构。360拥有统一的内存,所有线程都可以访问所有内存。这就很容易部署多线程应用了,因为你无需担心谁会访问内存的问题。每个SPE都只有一个有限的内存,这意味着作为开发者你得谨慎设计你的算法。

此外,从纯运行表现的角度来看,游戏主机中的CPU要比传统桌面电脑更低。虽然它们的计时速度很相似,但它们采用的是更简单的芯片,没有大量类似于无序执行等高级技术。所以这就迫使开发者采用多线策略,以便让自己的游戏脱颖而出。

在桌面PC领域,多核CPU快速成为标准,但开发者却没有及时让自己的程序与时俱进,并利用额外核心。这很大程度上要归咎于两个因素。其一就是桌面领域有更广泛的硬件支持,所以他们倾向于以最低标准来设计游戏。这意味着他们想支持更旧,并且只含有一个核心的系统。其二就是多线程编程更困难,它需要采用不同的处理方法。再加上许多游戏团队一般会重用技术这一现实,为了利用多个核心,开发者就需要重写大量内容。此外,图像通常是游戏中的一大瓶颈,向GPU提交所有的绘制内容要受限于API的一系列惯例(通常是DirectX或OpenGL)。只有最新版本的API(发布于数年之前)有可能最大化的利用现代CPU中的可用核心。现在有了最新的次世代主机,开发者就没有理由不使用多线程游戏引擎来制作顶级游戏了。

下一步就是了解线程之间如何相互通话。这一点很重要,也正是多线编程为何如此具有魅力的原因。为简便起见,我将以基于X86-64结构来叙述,但根据我的经验这也适用于所有计算设备。

读取内存

让我们看看当一个线程读取内存时会发生什么情况。首先,处理器会请求来自特定存储地址的负载存储到一个本地寄存器。这样我们就可以在单次过程进行处理,我将避免讨论虚拟内存地址和TLB之后的细节,所以只需要将其视为一个实体存储地址即可。

之后就会产生多种数据高速缓存,以便查看这个地址是否存在。现代多核芯片含有3个层次的缓存是个极为普遍的现象。在每个缓存层次中,你会获得更多可用的存储器,还有更长的访问潜伏。这些滞后的缓存甚至可以被多个核心所共享。我们将遇到最糟糕的情况,即该地址并不存在任何缓存中,这就是所谓的缓存丢失。如果它存在于缓存中,就可以称为缓存命中,该地址的数据也会更快到达核心。

4960X-Core(from altdevblogaday)

4960X-Core(from altdevblogaday)

所以在该地址到达计算机的主内存后,该位置的数据就开始自己返回处理器的漫长旅程了。注意在现实世界中的虚拟存储地址,数据位置可能实际上是在硬盘驱动器,这意味着我们必须在自己能够访问之前等待缓存定位我们的需求。在返回我们等待的核心时,每个缓存层次都要更新以便存储数据,这样未来的任何访问都会更为迅速。

因为在缓存之外的每段旅程都会很缓慢,单次读取就会涉及该地址所请求的数据。这个数据的大小与单次缓存线大小相当,通常是64或128字节。缓存通过将内存划分为对齐的缓存线并对其进行存储而运行,但其中的运行细节并非本文讨论的范畴。需要注意的是,在任何共享缓存中,如果某一线程的缓存引进了数据,那么其他任何线程也可能会需要读取该数据。尽管缓存布局与特定处理器类型有关(执行线程的核心有可能变化,除非你正确调整了其中的密切关系),这种优化类型可能难以真正执行。

memory hierarchy(from altdevblogaday)

memory hierarchy(from altdevblogaday)

写入内存

一个程序如果没有向内存写入什么东西,实际上就没毫无用处了,所以让我们简要概括一个线程写入内存时会发生什么情况。

这要从指定写入什么数据,以及写入哪个特定存储地址开始,其原理与之前的读取内存一样。核心将执行这种编写指令,这将由存储总线传送到缓存。缓存将调查总线以确认正在写入哪个地址,并据此更新它们的入口。只有在更新数据的缓存线被取代之后,更新数据最终才会到达主内存。所幸,执行编写指令的线程在转向下一个指令之前并不需要等待编写完成。

你需要记住的一件事就是在现代无序CPU(以及编译程序优化)中,你编写代码的顺序并不一定是其执行的顺序。虽然这种做法可以确保代码运行功能与你之前所指定的一样,但这并不适用于其他线程。这种做法是为了补偿读取(或写入)来自缓存数据的时间延迟。你可以引进存储器障碍(游戏邦注:或称为存储器围墙)向编译程序和处理器发处不要执行这一操作的信号。当我们拥有多个线程时这一点就会变得很重要,因为我们将需要协调它们的行动,并执行特定的操作步骤。还要注意的是,你能将存储器障碍与关键字活性混淆起来,后者在C/C++多线程方面不利于你完成这一操作。

当多线程写入同个数据位置时,最后一个执行者通常就是赢家(实际的行为可能要取决于基本硬件)。但线程是由操作系统所控制,要保证哪一个是赢家几乎是不可能的。所以你有多个线程写入同一个区域时就要极端谨慎。

所幸我们还有一些工具有助于我们完成操作,接下来我就来谈谈这种情况。

原子操作

最后,我们将接触线程之间的关键通话环节,这就是原子操作。正如我之前所言,处理多个线程在同个数据上的操作时,确保操作顺序几乎是不可能的。即使某个线程在另一者之前执行,该线程也可能被抢先并处下落后状态。这可能在任何时候发生且不受你的控制。除此之外,一个线程也可能延迟等待一个读取或写入操作(某个线程到达缓存,而另一个却没有)。所以你永远无法指望一个线程优先,或多个线程在没有同步能力的情况下以这种方式进行组织。原子操作担任了这个重要的角色。这些可以直接在CPU上执行,因为操作不可以中断(在特定约束条件以单个指令执行多个操作),所以它们将以一系列方法执行,无论是否存在其他线程或操作系统的干扰。

基本的原子操作就是比较和切换。它的作用顾名思义就是在与不同数据切换之前对比数据。这样你就知道自己在运行还没有变化的数据。

让我们看看使用一些伪代码所做变量例子。假设我们将比较与切换功能称为返回布尔数以指示它是否成功的CAS。

// our function signature
bool CAS(void* AddressToWrite, int CompareValue, int SwapValue);

static int x = 0;
local int y = x;
while (CAS(&x, y, y + 1) == false)
{
y = x; // fetch the new value
}

假设这发生于可以被多个线程调用的函数,我们就必须保证任何其他线程从我们将它读到本地变量到增加它的这段时间中都可以改变X值。如果发生了这一情况,那么我们就必须读取新的值并再次尝试,记住我们可能会因为其他线程而落后。但是,我们最后都会成功,除非我们陷入了其他线程持续击中这部分代码等情况,而在正常程序中这种情况几乎不可能发生。

还是使用我们的比较与切换函数,执行一个简单的互斥锁。我们可以用一个变量作为锁值。之后我们可以尝试通过查看该值是否为0获取该锁,并将其设置为1,并以一种相似但相反的方法释放该锁。以下就是其中的一些伪代码。

Lock:

static int lock = 0;
while (CAS(&lock, 0, 1) == false);

Unlock:

// technically this should always succeed assuming
// we successfully locked it in the first place
while (CAS(&lock, 1, 0) == false);

本文为游戏邦/gamerboom.com编译,拒绝任何不保留版权的转载,如需转载请联系:游戏邦

Multi-Threaded Programming 1: Basics

Joseph Simons

This will be a series of posts in which I will cover a good portion about what I have learned in terms of multi-threaded development in the realm of video games. I am writing this since I have been inspired by informative series of posts here such as Alex Darby’s Low Level C/C++ Curriculum. So my goal is to explain how multi-threading works at a low and high level, as well as looking at a few common scenarios that can benefit from these techniques. I might even touch on a few situations you can get yourself into and some techniques for debugging them. I hope you will enjoy reading about it as much as I enjoy discussing it.

Spools of thread

I am not the first person to make this joke…

What is Multi-Threading?

Lets start at the very beginning and cover exactly what a thread is and why having more of them (up to a certain point) is a good thing.

Every program that runs on a modern computer (such as your PC, game console, or smartphone) is a process. These can have child processes as well, but for simplification I am not going to cover those. These processes can have a number of threads associated with them. These threads are what execute the actual code for the running program. Threads run code independently but share memory.

This allows them to operate easily on the same data but perform different calculations. The fact that they share memory is the powerful, double-edged sword of their functionality.

A modern CPU is typically composed of several cores. Each core can run one thread at a time (though hyper-threading can sort of run two interleaved, but I am going to ignore that for simplicity).

The way that a computer with a single core simulates multitasking is that the operating system can switch between threads at will. So while only a single thread can be executing, enough are getting a chance that everything appears to be running simultaneously. This means that if your program wants to make full use of all the available computing resources your CPU has, you don’t want to have more threads than it has cores, since you will be switching between executing your own threads. This switching between executing threads is called a context switch, and while it isn’t exorbitantly expensive, you still want to avoid it as much as you can to achieve optimal speed.

The main parts of a thread are the program counter and other registers, the stack, and any accompanying local data storage. The program counter keeps track of what part of the code it is currently executing. The registers keep track of the current values of the executing code. The stack holds any values that don’t fit in the registers. And the local data is program specific, but generally keeps track of data that is local to that thread. This local storage is accomplished via a lookup table, managed by the OS, that will have each thread look up into its specified index to see where its data is kept. In short, this allows you to have variables that you can access like a global, but are in fact unique to each thread.

So, taking into account all of this, multi-threaded programming is when you use multiple threads in your program. Typically, you are doing this so that you can execute your program faster than you would using a single thread. Another common usage is to separate out your UI from the rest of the program, so that it always feels responsive. The difficulty in doing so lies in designing the systems in such a way that they can make use of multiple threads in these manners.

Why do we need it?

In short, we need this in order to make the most use of the available resources. But the real question is more likely, “Why do we need it, now?” and this is due to the trends in computing.

Since the dawn of the CPU, as years have gone by hardware manufacturers have been able to increase the speed of computers by increasing their clock rate. However, a little less than a decade ago, consumer products topped out around 4GHz for awhile. A lot of this was likely due to the effects of electromigration since CPUs were getting smaller and also as the frequency increased they were releasing more heat. While we are now getting processors that are reaching 5GHz for consumer use, that left ten years where chip designers had to think laterally. And this means that one of the key innovations was putting more CPU cores onto a single chip. So no longer could you expect your games to be faster just by putting in new hardware, you had to design the game so that it could work well using multiple threads running on multiple cores.

One of the first areas where multi-threaded computing required games to adapt was with the release of the Microsoft Xbox 360 and Playstation 3, and for two similar but different reasons. The 360 had three cores with Hyper Threading technology (HTT), meaning that it could conceivably run six threads in parallel. The PS3 had a single core with HTT, and seven Synergistic Processing Elements (SPE), in a new architecture called the Cell Broadband Engine. The 360 had unified memory, so that all the threads could access all the memory. This made it easy to implement multi-threaded applications since you didn’t have to worry about who could access what, unlike with the PS3. Each SPE there had only a limited amount of memory that it could make use of, meaning that as a developer you had to carefully design your algorithms to take this into account.

Also, from a pure performance standpoint, the CPUs in the consoles were slower than their traditional desktop counterparts. While they were clocked similarly in speed, they were simpler chips that didn’t have a lot of the fancier technologies such as Out-of-order execution that made the most of the available power. So this forced developers to cope with multi-threaded strategies in order to make their games stand out amongst the rest.

In the desktop world, multi-core CPUs were quickly becoming the standard, but it appeared that developers were slow to adopt their programs to make use of the additional cores. A lot of this was likely due to two main factors. First is that the desktop world has a much broader range of hardware to support, and so they tend to design with the lowest common denominator in mind. Meaning that they wanted to support older systems that likely only had a single core available. Second is that multi-threaded programming is more difficult, and it takes a different approach to handling it.

Couple this with the fact that a lot of game teams tend to reuse technologies, so in order to take advantage of multiple cores a lot of things would have to be rewritten. Also, graphics tended to be a big bottleneck in games, and submitting all the stuff for the GPU to draw was restricted to being done in a serial manner by the APIs (DirectX or OpenGL, typically). Only with the latest versions of the APIs (released only a few years ago) has it really been possible to make the most use of all the available cores in modern CPUs. And now with the latest generation of consoles upon us, developers have no excuse to not use heavily multi-threaded game engines for the top tier games.

Multi-Threaded Programming 2: Communication

The next step in learning about multi-threaded programming is going to involve seeing how threads communicate with each other. This is important so that it is understandable why many of the pitfalls when programming with multiple threads exist. I am going to stick with x86-64 based architectures for simplicity, but this is pretty applicable to all computing devices from my experience.

If you want to read up on other things that I have covered, then here is my previous post in this series:

Multi-Threaded Programming 1: Basics

Reading From Memory

So to start, let’s see what happens when a thread reads from memory. First, the processor will request a load from a specified memory address to store it in a local register. Since we are just dealing with a single process, I will avoid discussing the details behind virtual memory addresses and TLB, so just consider this a physical memory address.

This will then hit various data caches to see if this address exists in them. It is actually pretty common for a modern multi-core chip to have 3 levels of cache. With each level of cache, you get more storage available, but also longer access latency. These later caches can even be shared by multiple cores. We will go with the worst case scenario in that the address does not exist in any of the caches, which is known as a cache-miss. If it had, then it would be called a cache-hit and the data at that address would get to the core much faster.

In this case the L1 and L2 caches are specific to each core.

So after the address has made it all the way to the main memory of the computer, the data at that location begins its long trip back to the processor. Note that in the real world with virtual memory addresses, the location of the data could actually be on the hard drive, meaning that we need to wait for that slow thing to locate what we need before we can get access to it. Along the way back to our waiting core, each level of the cache is updated to have that data stored, so that any future access to it will be much, much faster.

Because each trip outside of the caches is slow, a single read will pull in data around the address requested. The size of this data is equal to the size of a single cache line, which is typically 64 or 128 bytes. The cache works by dividing memory up into aligned cache lines and storing those, but the specifics of how that might occur is beyond the scope of this series. It should be noted that in the case of any shared caches, if one thread pulls in data into the cache that it will be available for any other threads that might also need to read that data. So it can actually be useful if multiple threads are reading from the same data since the first thread will pull it into the cache and then the other threads can benefit from that. Though since cache layout is pretty specific to the type of processor (and the core that is executing your thread can potentially change unless you adjust the affinity correctly), this type of optimization can be difficult to actually implement.

I stands for Instruction, AKA code. D is for Data.

Writing To Memory

A program is pretty useless without writing something back to memory, so let me briefly cover what happens when a thread does just that.

It starts by specifying what data  to write and specific memory address to write to, just like with the read earlier. The core will execute this write instruction (typically referred to as a store), which will be put on the memory bus to head to the caches. The caches will snoop the bus to see what address is being written to, and will update their entries accordingly. Only after the cache line that has the updated data needs to be replaced will the updated data finally make it to main memory. Fortunately, the thread that executed the write instruction doesn’t need to wait until the write completes before moving on to the next instruction.

One thing you do need to keep in mind is that with modern Out-Of-Order CPUs (and compiler optimizations), the order that you write the code in is not necessarily the order that it will execute in.

While there are guarantees in place that will make the code functionally work the same as you initially told it, these do not apply when it comes to other threads. This is done to compensate for the delays that occur due to the time it takes for data to come from (or to) the cache. You do have the ability to put in memory barriers (also known as memory fences) that will signal to the compiler and the processor to not do this. This will become important later on once we have multiple threads in the mix since we will need to coordinate their actions and enforce the order that certain steps perform in. Also note that you should not confuse memory barriers with the keyword volatile, which in terms of multi-threading in C/C++ (which I will be using for examples in later posts) doesn’t help you accomplish this.

When multiple threads are writing to the same data location, whoever executes last is typically the winner (the actual behavior may depend on the underlying hardware). However, since threads are controlled by the operating system, guaranteeing which one that will be is nearly impossible. So you have to be extremely careful whenever you have multiple threads writing to the same area.

Fortunately, we have a few tools that will help us do this, which I am going to cover next.

Atomic Operations

Finally, we are going to touch on a vital piece of communicating between threads, and that is atomic operations. As I just mentioned, when dealing with multiple threads operating on the same data, guaranteeing the order of operations is nearly impossible. Even if one thread is executing ahead of another, that thread can be preempted and then fall behind. Naturally, this can happen at any time completely outside of your control. Additionally, a thread can stall waiting for a read or write operation (say one thread hits the cache and another doesn’t). So you can never rely on a thread being first or multiple threads being organized in any such fashion without synchronization abilities. Atomic operations fill in this important role. These are implemented directly on CPUs as operations that cannot be interrupted (performing multiple operations in a single instruction with specific constraints), so they will operate in a serial manner regardless of other thread or operating system interference.

The fundamental atomic operation is the Compare and Swap. What this does (as the name implies) is that it performs a compare of the data before swapping it with different data. This is so you know that you are operating on data that has not changed (since another thread could have come by and beaten you to the punch).

Let’s look at a simple example of an increment using some pseudocode. Assume we have a Compare and Swap function called CAS that returns a boolean indicating whether it was successful:

// our function signature
bool CAS(void* AddressToWrite, int CompareValue, int SwapValue);

static int x = 0;
local int y = x;
while (CAS(&x, y, y + 1) == false)
{
y = x; // fetch the new value
}

Assuming this is in a function that can be called by multiple threads, we need to protect against the possibility that any other thread can change the value of x from the time we read it into our local variable to the time we increment it. If that does happen, then we need to read in the new value and try again, keeping in mind that we can be behind yet again due to some other thread.

However, we should eventually succeed unless we find ourselves in some situation where other threads are continuously hitting this section of code, which in a normal program would be neat impossible.

Also using our Compare and Swap function, we can implement a simple mutex lock. We can have a variable that will act as the lock value. Then we can attempt to acquire the lock by seeing if that value is 0 and then setting it to 1, and releasing the lock in a similar but opposite manner. Some psuedocode for those is below.

Lock:

static int lock = 0;
while (CAS(&lock, 0, 1) == false);

Unlock:

// technically this should always succeed assuming
// we successfully locked it in the first place
while (CAS(&lock, 1, 0) == false);(source:altdevblogaday


上一篇:

下一篇: