Wednesday, June 27, 2012

New article on LSE's blog

Another article on LSE's blog about the C! programming language, read this out !

C! - system oriented programming - syntax explanation

The article presents basis of C! syntax and is a follow-up of my previous article introducing the language (also on LSE's blog.)

Back To C: tips and tricks

Note: I've begun this articles months ago, but due to a heavy schedule (combine with my natural laziness), I've postponed the redaction. So, the article may be a little rambling, I apologize for that but I hope you'll found it interesting.

The C programming language is probably the one of the widely used, and widely known programming language. Even if there's some obscure syntax tricks, it's a simple language, syntactically speaking. But, it's probably one of the harder to learn and to use correctly !

There's several reason for that:
  • since it is intended to be a low-level tools, it let you do a lot of things, things that are most of the time considered ugly but necessary in some cases.
  • it's original behavior does not include modern notions of type checking and other kind of verification, you're just playing with a machine gun without any safety lock …
  • expressive power with simple syntax often means more complex work: C belongs to the same kind of languages as assembly or lambda-calculus, you can do anything you want, but you just have to code the whole things completely from scratch.

One the most important things about coding using C, is to understand how it evolves and what it was mean to be and what it is really now. Back in the first Unix days, Denis Ritchie presented its language as a high level portable assembler, a tools for system, kernel and low-level programming. But, since then, there are standards (ANSI, ISO … ) and the language is no longer the original low-level tool.

Using C nowadays is a difficult exercise and the path to working code is full of traps and dead-ends.

I'll try to lay down some of my « tricks » and experiences. I've been using C for almost 15 years now, and I'm teaching it for about half of that time, I've seen a lot of things, but I can still be surprised every day.

Listen to your compiler !



Historically, C compilers are a little bit silent. Basically, most bad uses are not really errors, they may even be legit in some cases. So, most compilers prefer to emit warnings rather than errors.

This is why you should activate all warnings, and even better activate « warnings as error ».

A good example is the warning about affectation in condition, take a look at the following code

while ( i = read(fd, buf + off, SIZE - off) )
  off += i;

This code will trigger a warning (« warning: suggest parentheses around assignment used as truth value » in gcc.) Of course, my example is a legit usage of affectation in condition. But, the usual confusion of « = » instead of « == » is one of the most common error in C, and probably one of the most perverse bug.

This warning is probably far better than putting left-value on the right hand-side, which only work  if one of the operands is not a left-value (hey, you never compare two variables ?)

As the message said, you just have to put extra parentheses to avoid the warning, like that:

while ( (i = read(fd, buf + off, SIZE - off)) )
  off += i;

Even if this example may seems naive, it reflects the way you should use your C compiler: activate all warnings and alter your code so legit use won't trigger any messages, rather than shut the screaming off !

Oh, and if you want good warnings and error messages, I strongly recommend the use the clang compiler (based on llvm), it gives the best error messages I've seen so far.

And, for my pleasure, here is another classical example of test and affect combined in a if-statement:

int fd;
if ( (fd = open("some_file",O_RDONLY)) == -1)
  {
    perror("my apps (openning some_file)");
    exit(3);
  }
/* continue using fd */

Identify elements syntactically


Have you ever mask an enum with a variable ? Or, fight with an error for hours just because a variable, a structure field or a function was replaced by a macro ? In C, identifiers are not syntactically separated, so you can do horror like that:

enum e_example { A, B };

enum e_example f(void)
{
  enum e_example        x;
  float                 A = 42.0;
  x = A;
  return x;
}

int main()
{
  printf("> %d\n", f());
  return 0;
}

What does this code print ? 42 of course ! Why ? Simply because the float variable A mask the enum constant A. The ugly part is that your compiler won't warn you and won't complain.

So, there's no solution on the compiler side, we have to protect ourselves from that kind of errors. A usual solution is to adopt syntactical convention: each kind of symbol have its own marker, for example you can write enum constant like A_E or, in order to have different names for each enum definition, you can prefix your constant with the name of the type.

Basically, you should have a dedicated syntax for type name, macro constants, enum member or any other ambiguous identifiers. Thus, my previous enum should be written:

enum e_example { A_E, B_E }; /* using _E as enum prefix for example */

Just keep in mind that identifier's size should be kept relatively small in order to preserve readability and avoid typing annoyance (you won't enjoy typing e_example_A more than once.)

Understanding Sequence Points


There's a notion of sequence points in the language: any code between two sequence points can be evaluated in any order, the only things you know is that all side-effects syntactically before a sequence point take place before going further and no side-effect syntactically after the sequence point will begin before the sequence point.

So, this notion obviously forbid ambiguous code like i++ * i++ or i = i++. In fact this code is not strictly forbidden (sometimes the compiler may issue a warning, but that's all), it belongs to the infamous category of undefined behavior and you don't want to use it.

But that's not all. What you should understand and enforce is that no more than one modification to the same location should occur between two sequence points, but also whenever a memory location is modified between two points, the only legit access should be in the scope of the modification (i.e. you're fetching the value to compute the new value.

So, you should not write something like: t[i] = i++, but you can write (of course) i = i + 1.

Now, what constructions mark those sequence points ?
  • Full expressions (an expression that is not a sub-expression of another)
  • Sequential operators: ||, && and ,
  • Function call (all elements necessary for the call, like parameters, will be computed before the call itself.)
That's all, meaning that in the expression: f() + g() you don't know if f() will be called before g() !

Here are my personal rules to avoid those ambiguous cases:

  • Prefer pure functions in expressions
  • Keep track of global states modified by your functions so you can establish which function can't be used in the same sub-expression
  • Each function should have a bounded scope: a function must do one things and if it you can't avoid global side effects, it must be limited to one or two global states per function.
  • Prefer local static states rather than global states so you can bound modification through the use of one function.
  • Prefer pointer above implicit references for function (C++)
The last point may disturb you, implicit references (in the sens of C++ or Pascal) are used, in both call-site and body of the function, like non reference arguments, hiding the persistence of the modification. This pose no threat where it used with intended modification, but inside expression this can lead to that kind of bug you fight against for hours. Using an explicit pointer requires a reference operator indicating the possible modification.

Functions' prototype have to be explicit

In C, pointers are used for a lot of things: array, mutable arguments, data-structures …

So you can pass a pointer to a function for a lot of reason and it could be interesting to find a way to differentiate the various usage of pointer as parameter.

I'll give you a quick overview of my coding style:
  • Pointer as array: when the pointer parameter is in fact an array, I always use the empty bracket rather than the start, for example:

  • float vect_sum(float tab[], size_t len)
    {
      float res = 0;
      float *end = tab + len;
      for (; tab != end; ++tab)
        res += *tab;
      return res;
    }
    

  • Pointer as reference (mutable argument): when passing a pointer to a variable in order to keep modification of the value of the variable in calling context, I explicitly use a star:

  • void swap(int *a, int *b)
    {
      int c;
      c = *a;
      *a = *b;
      *b = c;
    }
    

  • Data structure: most (all) linked data structures (list, tree … ) have a pointer as entry point. In fact, the structure is the pointer (for example the linked list is the pointer not the structure.) So, I hide the pointer in the typedef rather than let it be visible:

  • typedef struct s_list *list;
    struct s_list
    {
      list next;
      int  content;
    };
    size_t list_len(list l)
    {
      size_t r = 0;
      for (; l; l = l->next)
        ++r;
      return r;
    }

The last point is often disturbing for student and it need some explanation. First, consider the logic of a linked list: a list is recursively define as an empty list or a pair of an element and a list. Thus, the NULL pointer representing the empty list is a list, meaning that the list is the pointer not the structure (a pair can be viewed as a structure or a pointer to a structure, to be coherent we must choose the later definition.)
So, you're list is a pointer and thus the type list must reflect this fact.
This is where comes the usual argument: « but, if the pointer is not obvious in the prototype how do I know that I must use an arrow rather than a dot to access the structure member ? » There's two answers to this question:
  • First, when dealing with the code manipulating the list you know what you're doing, so there's no question !
  • You must be coherent, if you include the star in the typedef, you won't do a typedef on a struct. You don't flag the case where you should use an arrow but when you should use a dot !
Combining the rule for pointer as reference and pointer as data structure, you'll have a coherent strategy to define functions that modify the content of the data structure and functions that modify the structure itself (specially when modifying the entry point.) The following example show a functional add (adding an element at the head of a linked list without modifying the given pointer) and a procedural add (adding again at the head but modifying the given pointer.)

/* we're using the previous list definition */

list fun_add(list l, int x)
{
  list t;
  t = malloc(sizeof (struct s_list));
  t->content = x;
  t->next = l;
  return t;
}

/* note that l is passed as a pointer to a list, not a list */
void proc_add(list *l, int x)
{
  list t;
  t = malloc(sizeof (struct s_list));
  t->content = x;
  t->next = *l;
  *l = t;
}

The star in the second version indicate that the function may modify the head of the list (in fact, it will.)

Forbidding typedef of structure has an other positive aspect: when passing a structure to a function (or when returning it), the structure is copied (and thus duplicated) inducing a (small) overhead at the function call (or function return) and bigger memory usage. Hiding the fact that the value is a structure induces what I call hidden complexity: what look like a simple operation has in fact a non-negligible cost. Thus, once again, the good strategy is to let the fact that the value is a structure visible.

The special case of strings: in C, strings have no specific type, you use pointer to characters. Normally, you can view a string as an array of characters, but the usual convention is to describe strings as char* rather than char[], so strings are the only case where I use the star rather than the bracket syntax. It solves also the issue of arrays of strings, you can't use char[][] (this a two dimensional array, but since the compiler need the size of line to correctly translate indexes, you can't write such a type) the solution is to mix star and bracket, as in the following example:

/* argv is an array of strings */
int main(int argc, char *argv[])
{
  / ..
  return 0;
}

Conclusion

I hope these tricks where useful or interesting to you. As I say in opening, programming in C requires careful concentration and a bit of understanding the underlying semantics of the language.

These tricks are not magical receipts, if you want to be a C programmer, you'll need practice. The first rule of a programmer should be: Code and when you think you have code enough, then code again !

By coding I mean: try out ideas, implement toys, compare behaviors of various implementation of the same idea, take a look at the generated ASM code (yes, you should do that … ) And don't be afraid of pointers, pointers are you friends, you need them, they are always useful in a lot of situation, but you got to be nice with them and treat them properly (or you'll pay !)

As a second rule, I'll propose: find a convenient coding style ! Coding style are often useful to avoid misuse of data structures or ambiguous situation (such as the examples on enum), but when building a coding style, don't focus on code presentation, organization, comments nor forbidden features. The most important part of a coding style is the naming convention and a coherent way of describing function prototypes.