Friday, November 8, 2013

More C++11 threading: I want my lock-free !

Last time, I've tried to describe the change in C++11 that impact multi-threading. Now, I'll talk about implementing a lock-free data structures.

Sharing Memory


Before talking about lock-free, let me refresh your memory about the most important issue when doing parallel code: concurrent memory access.

If two (or more) threads share (i.e. use) a common memory location, like a counter or the entry point of a data-structure, each time they access that shared location, conflicts can arise.

The most striking example is the shared counter problem: some threads shared a common integer (our shared counter) and try to simple increment it. That's it, they just want to do a simple ++i !

Even with this simple example, you'll get a bad behavior. To convince you I've got some interesting results: I've got a bunch of threads (from 2 to 256) and each thread is running a simple loop (with exactly 2²² rounds) that do a simple ++i, with i a shared counter. So normally the final value should be number_of_threads * 2²² and thus we can compute a kind of error measure: (expected_value - real_value)/expected_value (in percent.) Here are the results:


Number of threadsError
249.24%
468.99%
884.21%
25689.87%
This test was run on a Intel Core i7-3770 (so 4 cores and 2 hyperthreads per core.)

Impressive, isn't it ?

How can a simple increment be so wrong ? It's quite simple, to increment you shared variable you've got at least to operation (from the memory point of view): in the first one you load the value at the given location; in the second one, you store the newly computed value (based on the loaded value) in the memory location. The issue is that between your load and your store, some others threads may have modified this value and you won't see it.

To solve that kind of issue, the usual solution is to use lock (for the given example, you can now use atomic types.)

When dealing with a data structure (lists, queues, trees … ) protecting access is even more critical since, not only you may loose information, but you can also completely break the data-structure.

Lock free ?


« OK, I got your point about protecting memory access, so what's that lock-free stuff ? »

Locks (any kind of locks) are probably the kind of operation that you want to avoid.

When a thread holds a lock, all other threads that are trying to obtain the same lock are blocked until the actual owner release the lock. That's the expected purpose. But, processes are blocked, even if the owner is not active.

The problem really arise when the number of threads or processes on the system is growing. Once again, it's quite simple: if the thread holding the lock is scheduled, all other threads waiting for the lock are blocked while the owner of the lock isn't really using the data !

So, intuitively, a lock-free algorithm provides a global progression property: there can always be an active thread making progress.

« I've heard about wait-free algorithm, is that the same thing ? »

No … here are some definition:

  • lock-free: an algorithm (or the operation on a data-structure) is lock-free if « the suspension of one or more threads will not stop the potential progress of the remaining threads. »
  • wait-free: an algorithm is wait-free if « every operation has a bound on the number of steps the algorithm will take before the operation completes. »
In lock-free algorithm, the system (that is our set of threads) maintains progression, while in wait-free algorithm, every threads have progression.

What do we need ?


Let me guess … brain grease ?

There're a lot of results about primitive needed to implement lock-free or wait-free operations. The main is result is that only one basic mechanism is needed, and two are available for that job:
  • Atomic Compare&Swap (CAS)
  • Load-Link/Store-Conditional (ll/sc)
A CAS belongs to the family of read-modify-write operations, but while most of operations of this family doesn't offer a better consensus number than 2 (can only be used to provide a lock-free algorithm for 2 processes), the CAS provides has an infinite consensus number. This operation is quite simple:

bool CAS(int *mem, int value, int cmp) {
  if (*mem == cmp) {
    *mem = value;
    return true;
  }
  return false;
}

On the other hand, ll/sc, is a pair of operations: the first one (Load-Link) load the content of a shared memory and keep track of that load, then the store-conditional update the memory location, only if it hasn't change.

The ll/sc mechanism is, theoretically, more powerful than a simple CAS. But that's theoretically. For various reason, most (all) concrete ll/sc implementation miss the theoretically semantics, breaking a lot of algorithm.

Since implementing CAS with ll/sc is simple (even with broken ll/sc), most algorithm prefer CAS.

How do we use CAS ?


So, we found that the CAS operation will be our primitive. It is available in the C++11 atomic operation, so we can use it (if you take a look at that article, you'll a find a template providing double-word compare swap for x86 and x86-64 using ASM inline.)

Here are some example using CAS to provide fetch&add and to implement a simple spin-lock.

First the fetch&add (like operator += )
template
T fetch_and_add(std::atomic& x, T value) {
  T             tmp;
  tmp = x.load();
  while (!x.compare_exchange_weak(tmp, tmp+value)) {}
  return tmp+value;
}

And the simple spin-lock class:

class mylock {
public:

  struct guard {
    guard(mylock& _lock) : lock(_lock) {
      lock.lock();
    }
    ~guard() {
      lock.unlock();
    }
    mylock&           lock;
  };

  mylock(): _on(0) {}
  mylock(const mylock& x) = delete;
  void lock() {
    // C++11 CAS need a ref
    int                 tmp = 0;
    while (!_on.compare_exchange_weak(tmp, 1)) {
      tmp = 0; // needed since C++11 CAS back-up changed value in tmp
      std::this_thread::yield(); // don't waste cycles
    }
  }

  void unlock() {
    _on = 0; // ok atomic store
  }

  bool try_lock() {
    int                 tmp = 0;
    return _on.compare_exchange_weak(tmp,1);
  }

private:
  // we use here a CAS for the demo, but C++11 provide atomic_flag which should
  // be more accurate here.
  std::atomic<bool>     _on;
};

So as you can see, most of the time we use a kind of retry-loop schema to achieve our goal.

You may have notice that I used compare_exchange_weak() in my code. What's weak is all about ? In the documentation, it seems that the weak version may fail for spurious reason but is less expensive for a retry-loop than the strong version. That's probably a matter of lower-level implementation, and since I'm using a loop, I'll use the weak version.

Lock Free Queue ?


So, how do I implement a lock free queue ?

There's few main ideas behind that kind of data structures: behind able to fail and retry (like a transaction), find linearization point and try to concentrate global structure modification into one or eventually two atomic operation and finally, find a way to finish other threads works.

A linearization point is where the effects of an operation appears to happen for observers (other threads.)

For example, if you have a pointer to the first element in a linked list and want to delete that first element. So we want something like this (suppose Head is a shared global entry point of some list):

tmp = Head;     // retrieve the head pointer
Head = tmp->next; // cut-off the head
... // do what you want with it

Main issue in a concurrent context is that Head and Head->next may change while we're using it. Also note that the linearization point is when I switch the head with it's successor. This operation must be atomic and thus it needs a CAS.

Let's Build That Queue


So, the global idea is to fetch the pointer we need and try to switch the head, if it fails (if the head has changed since our previous read) we need to retry, here is a possible implementation (pseudo code):

do {
  tmp = Head;       // fetch the head
  next = tmp->next; // fetch the next element
  // be sure our pointer is still accurate
  if (tmp != Head) continue;
  // get the content
  // switch the head and the next element
  if (CAS(&Head, next, tmp)) break;
  // if CAS fails, try again !
} while (true);

This example doesn't manage empty list case and some other details. To avoid issue when dealing with empty list or list with one element (there's more issues with one element than an empty list), we use a sentinel: the head of the list is always a dumb node, only the second one matters. Here is a complete version of a pop operation (C++11 version):

struct List {
  struct node {
    int                 value;
    std::atomic  next;
    node() { next = 0; }
    node(int x) : value(x) {}
  };

  std::atomic    Head, Tail;

  List() {
    Head = new node();
    Tail = Head;
  }

  bool pop(int *lvalue) {
    node       *head, *tail, *next;
    do {
      // acquire pointers
      head = Head.load();
      tail = Tail.load();
      next = head->next.load();
      // check if we're coherent
      if (Head != head) continue;
      if (next = 0) return false; // empty queue
      if (head == tail) {
        // missing completion on tail (see later)
        // do the job of an unfinished push
        Tail.compare_exchange_weak(tail, next);
        continue;
      }
      lvalue = next->value; // copy value
      // done ? try to cut the head off
      if (Head.compare_exchange_weak(head, next)) break;
      // fails ? got to retry
    } while (true);
  }

  void push(int x); // see later
};

Next step is the push operation. While for the pop operation we only have one modification in the list (replace head with its next) for the push operation we need two modification: changing the next of the tail element and then replace the tail with the new element.

A thread can arrive on the queue with the next field of the tail element not null (that is: we have added the element but haven't update the tail pointer.) In that case, the current thread can try to finish the operation. We've already done that in the pop operation: when the head and the tail are equal but their next pointer is not null, the poping thread can do the CAS itself to update the tail.

An interesting point with this CAS is that if the original pushing thread do the update itself between the test and the CAS of the other thread (the one that want to finish the job) since the tail pointer has already been updated the CAS will fail silently without any consequences !

To summarize: when pushing, you'll build your new list cell, and try to attach it to the last cell (the tail) of the list, once done you can replace the tail pointer with the pointer to the new cell. During an update (push or pop), if a thread find the queue in an intermediary state, it can do the update itself.

Here is the code:

void List::push(int x) {
  node         *node = new node(x);
  node         *tail;
  do {
    tail = Tail.load();
    if (Tail != tail) continue;
    if (tail->next != 0) {
      // missing completion on tail
      Tail.compare_exchange_weak(tail,tail->next);
      continue;
    }
    // update the next of the tail
    if (tail->next.compare_exchange_weak(next,node)) break;
  } while (true);
  // finally update the tail (may fail, but we don't care)
  Tail.compare_exchange_weak(tail, node);
}

Note that the final CAS can fail, but, if it does the only reason is that someone has already done the job for us, so we don't need to care !

Memory Issues


In the previous example we avoid the question of releasing/recycling memory, for a good reason.

Why ?

You can delete a list-cell in the pop operation once you swap the head with its next pointer (at the end of the function after the loop.) But are you sure that you safely delete it ?

What can go wrong ? Some other thread holds the same pointer and may try to use it.

Once again, that's a question of timing: a thread trying to take an element of the queue will keep and use the pointer to the head in order to access its next field, so if the corresponding node has been freed this operation may fail.

A lot of lock-free implementation are done in Java where there's a garbage collector: a node won't be freed until all its persistence roots are active. But, in C++ there's no such thing …

A possible solution to avoid segfault is to use a recycling allocator: cells are not deleted but reuse as new nodes. Thus, accessing a node will always make sense, and won't fail.

Are we safe now ? No, another issue arise with recycling: the ABA problem.

ABA ?

No, we don't talk about Swedish pop music. The ABA issue is the weak point of the CAS operation: observing only if a pointer has changed may not be enough.

In brief: a first thread get the head of the queue, found a pointer A and get scheduled, during its inactivity phase, another thread take the first element of the queue and the head pointer becomes B, but some other changes may occurs and finally, the head pointer becomes A again (but it's content has change.) The first thread won't see that change, but if it has stored the head and its next, the next value will be false, but the CAS won't fail and we loose data coherency !

ABA issue in a stack
There are various solutions to the ABA problem:

  1. Use a garbage collector (like in Java)
  2. Use the so-called IBM solution with tagged pointers
  3. Use Hazard Pointers
The GC solution is still not possible and in fact, it's not really interesting. Let's see the two others.

Tagged Pointers


The idea of tagged pointers is to associate a counter with our pointer and manipulate the pair (stored contiguously) as a whole value. The counter is not a reference counter: it's not really possible to implement an efficient ref-count in lock-free. Our counter here is just a kind of tag that we increase each time we use the pointer. So when comparing the pair pointer/counter even if the pointer has the same value, if it has been updated, the counter will have different value and the CAS will fail.

This technique is quite simple and efficient, but it requires a double-world CAS, that is a CAS that work on one location but for two contiguous words (not to be confused with a double CAS, that works like two CAS on two different locations.) Such an operation exists on most 32bits processors but is only available on modern x86 processors with 64bit extension. C++11 provides 16, 32 and 64 bits CAS for atomic types but no 128bits version.

This method is used in the original article describing the lock-free queue: Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms

I've also write an article explaining how to use cmpxchg16b instruction using C++ template and inline ASM: Implementing generic double-word compare and swap for x86/x86-64

Hazard Pointers


Hazard Pointers is a technique describe in an other article from Magged Micheal (one of the author of the work on lock-free queue): Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects (pdf)

It's like a simplified GC: each threads has some hazard pointer variables where it stores pointers that need to be preserved and when some thread tries to delete some pointers, it first verifies that this pointer is not in used (stored in a hazard pointer var.)

Hazard pointer variables (HP) are in fact shared variable that is writing by one thread and readable by all others. When a thread want to delete a pointer, it first adds it to a wait-list, when this list reach a threshold, it retrieve all the pointers that are stored in a HP and only delete those that are not in used.

Despite some tricky details about managing the attribution of HP, the implementation is quite simple, it is wait-free and when choosing the right data-structures it can be pretty fast. In practice, I've seen no difference in performance with the IBM techniques (tagged pointers.)

I've got a full implementation available here: hazard_pointer.hh

More ?


So that's just a beginning, I start playing around programming all those data structures (available in a beta version) and you can take a look at it in my bitbucket repos: https://bitbucket.org/marwan_burelle/lockfreeexperiment

Of course there's a lot of information on the net, but start from the two paper already cited (lock-free queue and hazard pointers.)

My next step will be to implement wait-free data structures using ideas presented in recent works that announce very good performances, that's another story …

Tuesday, May 14, 2013

Playing With C++11 Multi-Threading Stuff

So, you may be interested in what I'm doing in my spare time ? I'm playing with concurrent-friendly data structures. The idea is to implement in C++ data structures that are available for Java programmers in the Java framework (and some newer ones if possible.)

It started when I was searching algorithmic example for parallel programming lecture. I found the classical article from Micheal&Scott published in PODC96 called: Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms.

This paper presents two queue algorithm a blocking one with two locks and a non-blocking one (also called lock-free.) So I wanted to implement both algorithm in order to explain them to the student.

The blocking algorithm is very simple, in fact it belongs to that kind of algorithm that seems obvious after reading it. It echoed some ideas I've tried by myself and solved all the issues I was thinking of.

The non-blocking algorithm requires more time. In fact, I was blocked by the fact that my programming language (C) doesn't provide a CompareAndSwap operation (I've finally found what I was looking for in gcc intrinsinc, but only to found another issue in the 64bit version.)

I've finally managed to solve that question by implementing mixing ASM and C++ template in order to have a working double-word CompareAndSwap operation. You can find all the story in this article published on the LSE's blog.

And then I've started reading about C11 and C++11 stuff. And found that it become the right time to explore both the new elements in the standards and alll that funny data structures.

But, let's start with C++11, will talk about lock-free later.

What's the deal with all that stuff on the memory model ?

In fact what is a memory model anyway ?

So, we need to take a look at some notions in compiler and programming languages.

Semantics ?

If you're a programmer, you probably have some knowledge on that compiling stuff: you give higher-level code to some program (the compiler) and it gives you back a binary file, that suppose to be some machine-level code. If you try to wrote ASM by yourself, you've probably found that there's quite a huge gap between your favorite programming language constructions and assembly code. Of course, for the experienced eyes, even if you don't know the target assembly, you'll see patterns in the code and probably understand the link between the two world.

But, anyway, the translation between higher-level code and assembly code is not simple. That's not just basic line to line translation. The formal way to understand this translation is what we call semantics. Language semantics explains what value an expression (for example) should return, and eventually what memory cells are affected by the computation.

When dealing with well-founded semantics, like with most functional programming languages, the description is theoretic, even with small-steps operational semantics that can be directly implemented in an interpreter, we don't really have a description on how to obtain results, but what results are expected. This way, the language can be implemented in several ways, and doesn't rely on specific architectures, or algorithms.

C and C++ doesn't belong to that kind of languages (may say hopefully … ) But, lot of people think (and I was also thinking that way when I was student) that the compiler describe the semantics. That is: the semantics of C program is defined by how standard compilers translate the code.

This is false and stupid. C and C++ have a semantics, but when a well-founded programming language validation system try to forbid code for which no semantics is clearly defined, C and C++ let the compiler decide. So, this is not simple, but the thing you need to understand is that the semantics gives a lot of liberty to the compiler on how to translate things. And the semantics only gives a loose definition of the expected results for a subset of the language.

One of the key idea behind this approach is to give the maximum liberty to the compiler so that it can optimized the code using all the available tricks.

Take this little code sample:

int res = p1->x * p2->x + p1->y * p2->y + p1->z * p2->z;

(given that p1 and p2 are pointers to some kind of 3D point structure.)

Edit: I've moved this code from float to integer. Because, as you know re-ordering and floating point …

If you played with SIMD extensions of some CPUs you know that we can obtain the same result with faster code using these extensions. But it will change the order (ok, not much) in which the expression is evaluated. You know that it doesn't change the final result, but a too restrictive semantics will forbid that kind of rewriting.

What's the deal with re-ordering ? It doesn't change the result ? Yes, but only because we don't have any side-effect here, nor do we need to respect some order in our memory accesses (think about accessing a hardware for example.)

So, rather than making too restrictive rules (with the hope that clever compilers' writers find a way to auto-magically detect when the semantics can be brake), people writing standards for C and C++ decided to give maximum freedom and less constraints, with the drawback that some piece of code will have undefined behavior.

There's a lot of stuff to explain about this approach, but a good example is better: the standard doesn't give any rules on the order that kind of stuff be executed …

t1[i++] = t2[i++];

This is an undefined behavior and for that matter, the compiler may decide to produce code that wipe out your drives, because the standard doesn't say it can't do that.

Going Parallel

And now add some fun to that: multiple processors, multiple threads and a single memory. You begin to see what's happening next ?

So, let's dive into the hardware first. When dealing with memory accesses in the presence of multiple processors we got a lot of issues solve.

Basically the question is: do we need to synchronize memory operations ?

Let's take a real life example: at the crossing of two roads, we may add some traffic lights to avoid accidents. But, most of the time we don't need it, why ? Because, there isn't enough cars and if we use traffic lights we may block cars when it's not needed.

For shared memory it's the same idea: most memory operation doesn't require any synchronization because they won't conflict. Beware, I'm not saying that when accessing the same memory cell with two processors there isn't any risk, no I'm saying that most of the time processors won't access the same memory cells. So, you'll need to specify what's happening when you know conflicts can arise.

Then rather than synchronizing all memory operations, or even rather than trying to synchronize all dangerous memory accesses, most hardware are based on what we call the relaxed memory model: you must explicitly describe when you need synchronization, when you don't, you will have no guarantee of what will happen.

So what kind of synchronization do we have:

  • atomic operations: an atomic operation is a complex operation (doing more than one memory access) that executes as if it were a simple access. The usual operations are atomic exchange, compare-and-swap …
  • acquire barrier: this barrier enforces that operations after the barrier will be re-ordered before the barrier (that is, the processor won't try to antcipate load operations after the barrier.)
  • release barrier: this barrier enforces that all operations begin before the barrier will be finished before crossing it (that is, the processor won't defer concrete memory stores after the barrier.)
  • full memory fence: a full memory fence is a kind of combined acquire and release barrier: all pending operation must finish before the barrier and no new operation will start before the barrier.
The acquire/release things often disturb new comer in the world of concurrent memory accesses. The first things you need to know is that your processor doesn't act in a linear fashion for a lot of good reasons. To be short: it is able to start loading data from memory as soon as it able to decode the target address, even if some prior instruction are not yet finished; it is also able to defer store operation to a later point, that is start operations that are after the storing operation before concretely storing the data (note that you compiler may also try to do similar changes in your code.)

The classic example is the I-have-finished boolean mistake. Let's take a look at the following pseudo code:

shared int x = 0;
shared bool ok = false;

worker0() {
  x = 42;
  ok = true;
}

worker1 {
  if (ok)
    print(x);
  else
    print("not yet");
}

So, a quick reading let you think that this program will either print "42" or "not yet" but not "0", right ? Wrong !

The processor will probably defer some of the storage operations, if it's longer to write to x, it can change the ok to true before storing 42 in x. On the other part, it can also start reading x before ok, and thus read the older value and then the true flag in ok.

Barrier can be used that way here:

shared int x = 0;
shared bool ok = false;

worker0() {
  x = 42;
  release-barrier;
  ok = true;
}

worker1 {
  if (ok) {
    acquire-barrier;
    print(x);
  } else
    print("not yet");
}

This will solve the issue.

And thus, what about C and C++ ?

For years, the standard avoided the question. C99 and C++98 don't even mention that kind of issues as undefined behavior, no, they simply don't deal with concurrency. Until the release of 2011 new standards C11 and C++11 (in fact this was for quite some times in C++0x.)

What's the difference now ? Hey, guess what ? (most) concurrent accesses are undefined behavior !

What a (r)evolution, you may say. But in fact it is, because the standard also provides a way to introduce barriers. And thus, we are now able to code using atomic operations and memory barriers, without relying on the supposed behavior of a compiler and a processor.

If you have looked for implementation or discussion about implementing lock-free data structures, you may have noticed that most of them are done in Java, the only reason was that Java make the move to the relaxed memory model (basically the one I described to you) since the release of Java 1.5. So, you can rely on that model to justify you implementation, and be sure that your benchmarks make sense.

I hope that the arrival of the new C++ memory model will gives us a chance to see more C++ implementations.

And the other stuff ?

You want to know more about C++11 threads ? Read the specs !

I've got some slides (in English) on the subject here: Threads

You can also find interesting information in this (old but still accurate) article by Maurice Herlihy: Wait-Free Synchronization (pdf format.)

The next article will talk about my implementation of Hazard Pointers, their use in lock-free queues and maybe some words about testing all that code.

Saturday, March 23, 2013

Beginners' corner: factorization !

This a sample extracted from an exercise given to beginners students in C: a simple graph implementation. We're looking at function that add an edge between two vertices.

Context

The types are following traditional graph implementations, and tends to respect formalism used in algorithmic classes (so don't bother me about typedefed pointers !)

typedef struct s_adj   *adj;

struct s_adj {
  size_t        src, dst;
  adj           next;
};

typedef struct s_vertex *vertex;

struct s_vertex {
  size_t        ins, outs;
  adj           succ, pred;
};

struct s_graph {
  size_t        order;
  vertex        vertices;
};

Our representation stores directed edges in both ends: in the source's successors and in the destination's predecessors. Our function will take the graph and the identifiers of source and destination (index in the vertices tables).

Naive code


Straightforward implementation look like that:

void add_link(graph g, int src, int dst) {
  adj           cell_succ;
  cell_succ = malloc(sizeof (struct s_adj));
  cell_succ->src = src;
  cell_succ->dst = dst;
  cell_succ->next = g->vertices[src].succ;
  g->vertices[src].succ = cell_succ;
  g->vertices[src].ins += 1;
  adj           cell_pred;
  cell_pred = malloc(sizeof (struct s_adj));
  cell_pred->src = src;
  cell_pred->dst = dst;
  cell_pred->next = g->vertices[dst].pred;
  g->vertices[dst].pred = cell_pred;
  g->vertices[dst].outs += 1;
}

We build two cells for the lists of successors and predecessors of both ends. This code works, but that's all, it is far too long and looks dirty. And, even if it's a short piece of code, it is unmaintainable: if we change simple details (like the fact that the edge is described in the same direction in both ends) we have at least two modifications per change.

First, add variables for our vertices. Our type defines vertices as pointer to structure, and that's correct for me, I'll use pointer rather than array cells, so the new version looks like:

void add_link(graph g, int src, int dst) {
  vertex        s = g.vertices + src;
  vertices      d = g.vertices + dst;
  adj           cell_succ;
  cell_succ = malloc(sizeof (struct s_adj));
  cell_succ->src = src;
  cell_succ->dst = dst;
  cell_succ->next = s->succ;
  s->succ = cell_succ;
  s->ins += 1;
  adj           cell_pred;
  cell_pred = malloc(sizeof (struct s_adj));
  cell_pred->src = src;
  cell_pred->dst = dst;
  cell_pred->next = s->pred;
  s->pred = cell_pred;
  s->outs += 1;
}

It's a small change, but it gains a little more on readability. Oh, just a note: I don't like long lines, and long variable's (or function's) names, even they seems more explanatory, a variable with more than 5 characters makes your code unreadable. Reserve long name for major functions, class or types name, not for local variables nor parameter's names. This is a common error: thinking that saying more make the code more readable, but that's wrong, if you can't figure out what this variable is used for, it means that the code your reading is already fucked-up, you don't need an explicit name, nor a comment.

And about the vertex declaration: I prefer the use of pointer's arithmetic here for readability. It may seems strange to not use array notation, but, what do you think is better: &(g->vertices[src]) or g->vertices + src ? For me the choice is obvious …

Now: factorization !

The biggest issue in this function is the two almost identical blocks of code. A good idea should be to deffer cell initialization to an external function. This function is doing half of the link job, which give us a good name for that new function:

void add_half_link(int src, int dst, adj cell) {
  cell->src  = src;
  cell->dst  = dst;
}

But can we attach this cell to the cell directly to their corresponding list ? Yes, just pass a pointer to the list (I really mean a pointer to the list, thus a pointer to the pointer to cell in the list, the list is already a pointer, not a structure.

void add_half_link(int src, int dst, adj cell, adj *l) {
  cell->src  = src;
  cell->dst  = dst;
  cell->next = *l;
  *l = cell;
}

Our new function to the job for half of the link and the way we passed the list of edeges, let us use it in both cases (successors and predecessors.)

Putting everything together …

Now we just have to integrate our sub-function in the original one:

static inline
void add_half_link(int src, int dst, adj cell, adj *l) {
  cell->src  = src;
  cell->dst  = dst;
  cell->next = *l;
  *l = cell;
}

void add_link(graph g, int src, int dst) {
  vertex        s = g->vertices + src;
  vertex        d = g->vertices + dst;
  add_half_link(src, dst, malloc(sizeof (struct s_adj)), &(s->succ));
  add_half_link(src, dst, malloc(sizeof (struct s_adj)), &(d->pred));
  d->ins  += 1;
  s->outs += 1;
}

Note that I have declared our sub-function as static inline, it is obviously static since we don't want it to be exported, and asking for inline seems logic. We could have find a way to increment degrees (ins and outs) in the sub-function also …

Finally we got a code, that is (not so) smaller, but, and that's the most important, is more readable and maintainable.

Of course, you should have written the second version directly …

This is a simple example showing the idea that you can increase your code quality by using factorization.

Oh, and for the record, the functions only need a specification comment for the main one. Avoid as much as possible inner comments, they don't increase readability nor understanding of code, and most of the time they break the reading flow. That kind of functions are self-explanatory, and so must be most of your code, if you feel the need to explain it, think of reorganizing it or some other kind of refactor, only very specific tricks need comments, if you too many, you're probably doing things the wrong way.

There's another points about code comments: they tend to desynchronize. Your code will change, you'll probably find bugs … are you sure that all your comments are in sync with these changes ? No comment, no problem …

That's it for today ! Code well !


Tuesday, February 19, 2013

New post on LSE blog

A new post (in fact, an old one, written in july 2012 but forgotten in the publication queue) on the LSE blog.

http://blog.lse.epita.fr/articles/42-implementing-generic-double-word-compare-and-swap-.html

It's all about playing with asm-inline and C++ template ;)