Contents

Part I

Part II

12th November, 2009: Re-wrote the article to explain even better the nitty-gritty details. Thanks everyone for the feedback.
10th June, 2011: After feedback from a reader, I updated the "full queue" image. A typo in the corresponding text was also corrected.

Introduction

I attended Herb Sutter's Effective Concurrency Seminar (Europe 2009), a very inspiring seminar where he approached multi-core targeted programming. One thing we worked with was the dangers of Lock-Free programming and hazards of volatile in C++.

I got curious since I've seen a couple of simple single producer -> single consumer Circular Queue implementations that are implemented as Lock-Free and using just that, the important state variables defined as volatile. So how come they work (being used for years), or do they?

Volatile as you might know is not intended for multi-threading but for accessing hardware devices (and more). It turns out that volatile plays a minor role and that it's mostly the compiler and computer platform that decides whether or not this kind of Circular Queues will be safe to use.

I have implemented a thread safe circular queue in C++ and explain in this article when it works and when it will not work. This is an area outside the scope of the C++ standard (until C++0x) and is suitable for (at least) x86 or x64 platforms.

For those of you who want to read up on the inner workings of the Circular Queue, I have provided a short description of it and my implementation. For those of you who are already familiar with it and just want to know if this thread safe circular queue is for real, just go straight to the Implementation section.

Background

If an application does multiple tasks simultaneously, then multi-threading would be useful. In this particular example, I show a circular FIFO queue that can be used for communicating between maximum two threads. This scenario is often referred to as the single producer -> single consumer problem.

Example: High priority task (GUI, data recorder, etc.) that must not be delayed in lengthy operations and require them to finish in the same order as they are started.

The solution: Delegate the time consuming jobs to a worker thread. Using a FIFO queue makes it possible for the delegation and processing of tasks to be done asynchronously i.e. the producer and consumer each work in their own thread and therefore reads (remove) can be concurrent with writes (add). The FIFO qualities guarantee that delegated tasks are processed in the same order as they were added.

The producer can add new tasks to the queue as long as the queue is not full, and the consumer can remove tasks from the queue as long as the queue is not empty.

It's worth mentioning that Circular Queues are commonly used when:

  • Having limited memory or no dynamic memory allocation
  • Re-use of same memory without need for object 'shuffling' about when objects are consumed
  • Memory buffer overwrites if consumer can't keep up, etc.

But all of these examples are outside the scope of this article.

Text and Code Disclaimer

I decided to keep this example simple and not go into all the nitty-gritty details that an optimal solution might have. This is why any writes to a full queue or any reads from an empty queue are just plainly ignored. For the same reason, the queue is not optimized to avoid false sharing.

It can be argued that under some conditions like in my "test_of_fifo_queue" the queue will suffer from false sharing if both threads are on different cores. However under other conditions than in my silly test, this is not necessarily the case. What decides if there will be false sharing depends on what is stored within the array: integers, pointers or whole objects and if they are on the same cache line as both tail and head indexes are accessing.

IF there is false sharing, it's not always a very bad thing. It all depends on the frequency of the false sharing, i.e., how often the consumer/producer threads are trying to access the same cache line.

Implementation Overview

The simple circular queue is ideal when explaining the issues involved with lock-free programming. It is limited in scope but still very useful in real world practices.

I have implemented a simple circular queue (a.k.a. circular/ring buffer) to try to explain some of the issues involved in lock-free programming. I will try to describe why and when it works and why and when it will not work.

The approach of circular buffers I've seen previously have used a simple c-type array to store the data as raw bytes (using memcpy) with no regard to class type (if storing objects). At unpacking, the correct bytes are copied and reinterpreted to the correct type. I decided to use a different approach. In my example, I use a template style approach where the size of the Circular Queue and the type to store within are given as template arguments.

template<typename Element, unsigned int Size>
class CircularFifo {
public:
   enum {Capacity = Size+1};

   CircularFifo() : tail(0), head(0){}
   virtual ~CircularFifo() {}

   bool push(Element& item_);
   bool pop(Element& item_);

   bool isEmpty() const;
   bool isFull() const;

private:
   volatile unsigned int tail; // input index
   Element array[Capacity];
   volatile unsigned int head; // output index

   unsigned int increment(unsigned int idx_) const;
}; 

Notice that the tail and head are on opposite sides of the array. Not for any guarantees but at least it's a sporting chance that the tail and head indexes can be on different cache lines. It all depends on the array size. Buffering can always be applied to be sure.

The properties that decide the state of the queue are the array with Elements, the Size of the array and the indexes for current tail and head of the queue.

The Capacity of the array is one larger than the specified Size. This ensures a slack of one which makes it easy to distinguish between full buffer and empty buffer.

// Remember these?
enum {Capacity = Size+1};         // 1 extra for slack
volatile Element array[Capacity]; // gives easier full and empty determination

Circular FIFO - How It Works

Below I have described graphically some writes (push) by the producer thread and some reads (pops) by the consumer thread.

When the buffer is empty, both indexes will be the same. Any reads by the consumer will fail.

Empty Queue

Example of an empty queue
/// Consumer check for a an empty queue
if(head == tail)
    return false; // queue is empty

The Producer adds (push) a new item at the position indexed by the tail. After writing, the tail is incremented one step (or wrapped to the beginning if at end of the queue).

Queue with one item

Queue with one item

The Producer adds another item and the tail is incremented again.

Queue with two items

Queue with two items

The Consumer reads (pop) the item indexed by the head. The head is incremented one step.

Queue after pop of item1

Queue after pop of item1

If the producer pushes more items onto the queue than the consumer can keep up with, the queue will eventually become full.

Full Queue

Full Queue

When the buffer is full, there will be a one slot difference between head and tail. At this point, any writes by the producer will fail (otherwise the empty queue criterion, i.e., head == tail would come true).

/// Verify that queue is not full
if(((tail+1) % Capacity) != head)
{
   ... // do stuff, we're not full
}

Implementation

The thread safety is ensured through the class design. Only two functions can modify the state of the object, push() and pop(). They are each used by a single thread only. The producer thread should only access push() and the consumer thread only pop().

The design can be further emphasized with access through interfaces, but I think it's enough to have clarified this in the comments.

Using more producer or consumer threads breaks this design and the queue can become corrupt. If more threads are wanted, then a different solution should be used, i.e. queue with write/read locks.

push() changes only the tail, but verifies that queue is not full (check of head) and pop() changes only the head but verifies that the queue is not empty (check of tail). Only the producer thread will update tail and only the consumer thread will update head. Thus, thread safety for updating the head and tail is ensured by the design.

However reading and writing of both tail and head must be thread safe in the sense that we do not want to read old, cached values and writes must be atomic.

Important: See the text further down regarding this. It states how atomic, non-reordered reads and writes are achieved for this Circular Queue. It is probably the most important part of this article.

/** Producer only: updates tail index after writing*/
template<typename Element, unsigned int Size>
bool CircularFifo<Element, Size>::push(Element& item_)
{
   int nextTail = increment(tail);
   if(nextTail != head)
   {
      array[tail] = item_;
      tail = nextTail;
      return true;
   }

   // queue was full
   return false;
}

/** Consumer only: updates head index after reading */
template<typename Element, unsigned int Size>
bool CircularFifo<Element, Size>::pop(Element& item_)
{
   if(head == tail)
      return false;  // empty queue

   item_ = array[head];
   head = increment(head);
   return true;
}

Improvements and Code Notes

This code is minimalistic and made to show how some C++ lock-free circular queues are made. It is in no way optimized and should be considered as an example and not as a fully packaged and finished solution to fit all needs. In any case, this queue can be a good starting point if you need lock-free circular queue between two threads.

Making It Work

Remember that for a multiple-producer or multiple consumer scenario, this code will not work.

First, it is imperative that the head and tail indexes must be updated only after the element (pop() or push()) is read or written. Thus, any updates to the head or tail will ensure proper access to the elements. For this to work, the read and writes of the head and tail must be atomic and must not be re-ordered somehow.

Atomic

An atomic operation is an operation that is guaranteed to finish and not be interrupted half-way by another thread.

On almost all modern processors, reads and writes of naturally aligned native types are atomic. This means that as long as the memory bus is at least as wide as the type being read or written the CPU reads and writes will happen in a single bus transaction. This makes it impossible for other threads to see them half-completed.

For x86 and x64 platforms, types larger than 8 bytes are not guaranteed to be atomic. But in our example the head and tail indexes are integers of size 32bits (4 bytes) on 32-bit processor or 64bits (8bytes) on a 64-bit processor, thus making reading and writing them atomic.

 // Example of integers
 counter++;                // 1. not atomic operation, but three operations
 counter = 0;              // 2. this write is atomic
 other_variable = counter; // 3. this read is atomic

For fun, you can compile the operations above to assembly language or use the debugger to view the disassembly. You should then see that the actual store only takes one instruction.

On multiple core CPUs, this is also true. A word in memory must be coherent between all cores when it is written (using one CPU instruction). It is simply not allowed or even possible to split a 4 byte writing (I use a 32-bit example) between cores, since a word (4 bytes) is written as a single instruction.

In short. Reads and writes of tail and head are atomic. First is still OK.

Reordering

Second, the read and writes of the head and tail must be not be reordered. Reordering can happen by both compilers and processors. In fact reordering can happen both to instructions and read, write operations. For the Circular Queue to be safe, this must not happen to our volatile read and writes (or else First will fail).

Compiler Reordering

Compiler rearrangement is dealt with in this particular case by our use of volatile. The C++ standard defines that reads of volatile variables cannot be cached nor delayed at writing. It also defines that volatile reads or writes cannot be moved past each other.

To quote hax_:

"'volatile' variable access is also code rearrangement barrier: no lines can be moved up or down through it (you can treat every access to volatile variable as horizontal barrier in code editor)".

Normally this wouldn't be enough for multi-threading programming since the compiler can reorder non-volatile read/write relative to volatile read/write. But for this circular queue it's all read/write of volatile. Either the indexes themselves are read or written or the array is accessed with a volatile. Thus, this type of queue is safe from compiler reordering that could break First.

CPU Reordering

CPU rearrangement rules on x86 and x64 are guaranteed to not reorder instructions. They also do not reorder write operations relative to other writes, or read operations relative to other reads. The exception to these rules are for some special operations of writes and reads that doesn't apply here. Important to know is that writes cannot move ahead of reads, but reads can move ahead of writes.

Reorder Summary

CPU Reordering Summary for x86 and x64 Platforms

  1. Reads moving ahead of other reads: No
  2. Writes moving ahead of other writes: No
  3. Writes moving ahead of reads: No
  4. Reads moving ahead of writes: Yes

If writes could move ahead of reads, then First would be broken. But as it stands now, First is safe also from CPU reordering.

Making It Work: Conclusion

All of this long text boils down to the following: The Circular Queue's way of accessing tail, head and array is not allowed to be reordered. Thus we can be sure that each atomic update to the indexes is always done after updating the storage (push() or pop()).

Even when the consumer/producer threads are on different CPU cores, they will see the latest change and not some old cached values or messed up by reordering (provided the x86/x64 architecture restrictions).

What's Wrong with this Code?

From a practical, real-world look at the x86 and x64 computer architectures, this code will work. Using it on another platform, say a PowerPC could break the Circular Queue (PowerPC reorders things differently than the x86/x64).

A C++ purist is probably enraged by now by the use of volatile. :p

Disregarding that, it is also important to know that the C++ standard won't guarantee atomicity for integers (this is done nonetheless according to the constraints laid out earlier). With the coming of C++0x things will be much brighter and some confusion can be avoided when we can start using atomic<T>.

Remember also that multi-threaded C++ code is NOT platform independent. Even if it may appear so through whatever threading API you may be using (Qt/Boost/etc). The same goes for this example of a circular FIFO. It's not platform independent.

On a side note, I can mention that Microsoft has given extra responsibility to volatile on VC++ 2005 and newer. See here for details.

Finally ...

What I learned when working with multi-threaded software is that lock-free is anything but easy. If nothing else, there is so much conflicting information about what you can and cannot do that it could confuse anyone.

One view on this is what the C++ standard guarantees. A whole other view is the practical look on what we actually get from today's compilers and computer architectures in use and how they affect things. Lock-Free code with C++ is (until C++0x) a scope outside of the language standard and more of an issue with the compiler, CPU, chipset and memory controller.

One thing is to do a simple FIFO like in this example, a whole different issue is if you want to use something more complex than one reader, one writer. If you feel inclined to do something more advanced, like a lock-free linked list, maybe for multiple readers/writers (yes, simple as it sounds, it is not) then please don't...

I agree with Herb Sutter, this is a difficult topic that has lots of pitfalls. Event when C++Ox is available, more advanced lock-free structures should probably be avoided. If you are deciding to roll your own Lock-Free structure, use the tools available on your platform (like Microsoft's Interlocked**).

Last, if having any doubts. Use locks instead of lock-free. Sure it might introduce some overhead and it will be platform-dependent, but then again nothing is platform-independent when using C++ and threads.

References

  1. [C++ atomic int]
  2. Lockless Programming Considerations for Xbox 360 and Microsoft Windows
  3. Another example of C++ FIFO that I discovered when I was already finished.
    It had some nice discussions in the end about this very topic.
  4. [Herb Sutter] Volatile vs Volatile
  5. Wikipedia about Circular Buffer
  6. Another Dr. Dobbs Circular Queue example
  7. A visual explanation of circular queue
  8. Wikipedia: More about lock-free queues and atomicity
  9. Not related to C++, but maybe still interesting. A C# implementation

Other resources that I linked to, or that might be of interest:

  1. Lock-Free Code: A false sense of security
  2. Wikipedia - FIFO
  3. Wikipedia - Volatile
  4. Wikipedia - Producer, Consumer problem
  5. False Sharing
  6. "What every programmer should know about memory".
    A dry document, but especially chapter 6 (What Programmers can do) is worth reading.

History

  • 3rd November, 2009
    • Initial version
  • 4th November, 2009
    • Article updated - corrected sentences and highlighted keywords
  • 12th November, 2009
    • Article updated - fixed pictures and code display problems
    • Updated text after feedback from readers
  • 10th June, 2011
    • Updated "full queue" image and corresponding text after feedback from a reader
推荐.NET配套的通用数据层ORM框架:CYQ.Data 通用数据层框架
新浪微博粉丝精灵,刷粉丝、刷评论、刷转发、企业商家微博营销必备工具"