Do compilers produce better code for do-while loops versus other types of loops?

There's a comment in the zlib compression library (which is used in the Chromium project among many others) which implies that a do-while loop in C generates "better" code on most compilers. Here is the snippet of code where it appears.

do {
} while (*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
         *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
         *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
         *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
         scan < strend);
/* The funny "do {}" generates better code on most compilers */

https://code.google.com/p/chromium/codesearch#chromium/src/third_party/zlib/deflate.c&l=1225

Is there any evidence that most (or any) compilers would generate better (e.g. more efficient) code?

Update: Mark Adler, one of the original authors, gave a bit of context in the comments.

Answers


First of all:

A do-while loop is not the same as a while-loop or a for-loop.

  • while and for loops may not run the loop body at all.
  • A do-while loop always runs the loop body at least once - it skips the initial condition check.

So that's the logical difference. That said, not everyone strictly adheres to this. It is quite common for while or for loops to be used even when it is guaranteed that it will always loop at least once. (Especially in languages with foreach loops.)

So to avoid comparing apples and oranges, I'll proceed assuming that the loop will always run at least once. Furthermore, I won't mention for loops again since they are essentially while loops with a bit of syntax sugar for a loop counter.

So I'll be answering the question:

If a while loop is guaranteed to loop at least once, is there any performance gain from using a do-while loop instead.


A do-while skips the first condition check. So there is one less branch and one less condition to evaluate.

If the condition is expensive to check, and you know you're guaranteed to loop at least once, then a do-while loop could be faster.

And while this is considered a micro-optimization at best, it is one that the compiler can't always do: Specifically when the compiler is unable to prove that the loop will always enter at least once.


In other words, a while-loop:

while (condition){
    body
}

Is effectively the same as this:

if (condition){
    do{
        body
    }while (condition);
}

If you know that you will always loop at least once, that if-statement is extraneous.


Likewise at the assembly level, this is roughly how the different loops compile to:

do-while loop:

start:
    body
    test
    conditional jump to start

while-loop:

    test
    conditional jump to end
start:
    body
    test
    conditional jump to start
end:

Note that the condition has been duplicated. An alternate approach is:

    unconditional jump to end
start:
    body
end:
    test
    conditional jump to start

... which trades away the duplicate code for an additional jump.

Either way, it's still worse than a normal do-while loop.

That said, compilers can do what they want. And if they can prove that the loop always enters once, then it has done the work for you.


But things are bit weird for the particular example in the question because it has an empty loop body. Since there is no body, there's no logical difference between while and do-while.

FWIW, I tested this in Visual Studio 2012:

  • With the empty body, it does actually generate the same code for while and do-while. So that part is likely a remnant of the old days when compilers weren't as great.

  • But with a non-empty body, VS2012 manages to avoid duplication of the condition code, but still generates an extra conditional jump.

So it's ironic that while the example in the question highlights why a do-while loop could be faster in the general case, the example itself doesn't seem to give any benefit on a modern compiler.

Considering how old the comment was, we can only guess at why it would matter. It's very possible that the compilers at the time weren't capable of recognizing that the body was empty. (Or if they did, they didn't use the information.)


Is there any evidence that most (or any) compilers would generate better (e.g. more efficient) code?

Not much, unless you look at the actual generated assembly of an actual, specific compiler on a specific platform with some specific optimization settings.

This was probably worth worrying about decades ago (when ZLib has been written), but certainly not nowadays, unless you found, by real profiling, that this removes a bottleneck from your code.


In a nutshell (tl;dr):

I'm interpreting the comment in OPs' code a little differently, I think the "better code" they claim to have observed was due to moving the actual work into the loop "condition". I completely agree however that it's very compiler specific and that the comparison they made, while being able to produce a slightly different code, is mostly pointless and probably obsolete, as I show below.


Details:

It's hard to say what the original author meant by his comment about this do {} while producing better code, but i'd like to speculate in another direction than what was raised here - we believe that the difference between do {} while and while {} loops is pretty slim (one less branch as Mystical said), but there's something even "funnier" in this code and that's putting all the work inside this crazy condition, and keeping the internal part empty (do {}).

I've tried the following code on gcc 4.8.1 (-O3), and it gives an interesting difference -

#include "stdio.h" 
int main (){
    char buf[10];
    char *str = "hello";
    char *src = str, *dst = buf;

    char res;
    do {                            // loop 1
        res = (*dst++ = *src++);
    } while (res);
    printf ("%s\n", buf);

    src = str;
    dst = buf;
    do {                            // loop 2
    } while (*dst++ = *src++);
    printf ("%s\n", buf);

    return 0; 
}

After compiling -

00000000004003f0 <main>:
  ... 
; loop 1  
  400400:       48 89 ce                mov    %rcx,%rsi
  400403:       48 83 c0 01             add    $0x1,%rax
  400407:       0f b6 50 ff             movzbl 0xffffffffffffffff(%rax),%edx
  40040b:       48 8d 4e 01             lea    0x1(%rsi),%rcx
  40040f:       84 d2                   test   %dl,%dl
  400411:       88 16                   mov    %dl,(%rsi)
  400413:       75 eb                   jne    400400 <main+0x10>
  ...
;loop 2
  400430:       48 83 c0 01             add    $0x1,%rax
  400434:       0f b6 48 ff             movzbl 0xffffffffffffffff(%rax),%ecx
  400438:       48 83 c2 01             add    $0x1,%rdx
  40043c:       84 c9                   test   %cl,%cl
  40043e:       88 4a ff                mov    %cl,0xffffffffffffffff(%rdx)
  400441:       75 ed                   jne    400430 <main+0x40>
  ...

So the first loop does 7 instructions while the second does 6, even though they're supposed to do the same work. Now, I can't really tell if there's some compiler smartness behind this, probably not and it's just coincidental but I haven't checked how it interacts with other compiler options this project might be using.


On clang 3.3 (-O3) on the other hand, both loops generate this 5 instructions code :

  400520:       8a 88 a0 06 40 00       mov    0x4006a0(%rax),%cl
  400526:       88 4c 04 10             mov    %cl,0x10(%rsp,%rax,1)
  40052a:       48 ff c0                inc    %rax
  40052d:       48 83 f8 05             cmp    $0x5,%rax
  400531:       75 ed                   jne    400520 <main+0x20>

Which just goes to show that compilers are quite different, and advancing at a far faster rate than some programmer may have anticipated several years ago. It also means that this comment is pretty meaningless and probably there because no one had ever checked if it still makes sense.


Bottom line - if you want to optimize to the best possible code (and you know how it should look like), do it directly in assembly and cut the "middle-man" (compiler) from the equation, but take into account that newer compilers and newer HW might make this optimization obsolete. In most cases it's far better to just let the compiler do that level of work for you, and focus on optimizing the big stuff.

Another point that should be made - instruction count (assuming this is what the original OPs' code was after), is by no means a good measurement for code efficiency. Not all instructions were created equal, and some of them (simple reg-to-reg moves for e.g.) are really cheap as they get optimized by the CPU. Other optimization might actually hurt CPU internal optimizations, so eventually only proper benchmarking counts.


A while loop is often compiled as a do-while loop with an initial branch to the condition, i.e.

    bra $1    ; unconditional branch to the condition
$2:
    ; loop body
$1:
    tst <condition> ; the condition
    brt $2    ; branch if condition true

whereas the compilation of a do-while loop is the same without the initial branch. You can see from that that it's inherently less efficient by the cost of the initial branch, which is however only paid once. [Compare to the naive way of implementing while, which requires both a conditional branch and an unconditional branch per iteration.]

Having said that, they aren't really comparable alternatives. It is painful to transform a while loop into a do-while loop and vice versa. They do different things. And in this case the several method calls would totally dominate whatever the compiler did with while as against do-while.


The remark is not about the choice of the control statement (do vs. while), it is about the loop unrolling !!!

As you can see, this is a string comparison function (string elements probably being 2 bytes long), which could have been written with a single comparison rather than four in a shortcut-and expression.

This latter implementation is for sure faster, as it does a single check of the end-of-string condition after every four element comparisons, whereas standard coding would involve one check per comparison. Said differently, 5 tests per 4 element vs. 8 tests per 4 element.

Anyway, it will work only if the string length is a multiple of four or has a sentinel element (so that the two strings are guaranteed to differ past the strend border). Pretty risky !


This discussion of while vs. do efficiency is completely pointless in this case, as there is no body.

while (Condition)
{
}

and

do
{
}
while (Condition);

are absolutely equivalent.


Need Your Help

How to query GA export to BQ schema with hits.customDimensions.index in WHERE clause?

google-analytics google-bigquery

As of a few days ago, the following query worked fine on BQ with the schema generated by an export from GA: