An ongoing point i've made for many years to programmers is the use of simple assembler instructions, and avoiding "FOR" within C/C++ code.

Below us a simple example of this for GCC

Firstly, you can see I removed the "FOR" loop and replaced with a "DO/WHILE" statement in order to use the increment command.

Secondly, I replaced where the "i++" with the assembler "INC".

Many like to make arguments over "For loops are better" and "using assembler opcodes don't make much difference". Both points i've demonstrated multiple times in the past. People who believe this are usually "trained" by others who don't actually know how code is compiled and run, so rather than repeat something i've demonstrated time after time, i've an assembler output of the original code being compiled for X64, and my updated code;

Original output	Optimised output

First thing to notice is my assembler insert loop is slightly larger than the original loop. This would seem to make the actual loop take longer than before until you understand HOW the opcodes themselves work.

When you increment the memory location, it does just that without messing with too many other flags used by the processor, as opposed to adding - which loads the location and various flags such as carry, adds 1 to the value (or more, flag depending), sets new values to the flags and finally stores the result.

Let us look at Intels "INC" opcode versus the "ADD" opcode, notice that INC only sets flags on results, whereas ADD seems quite messy by comparison;

In short, it's far better to use INC than ADD 1 for counting iterations for loops.

The original code also starts with a jump, which is very strange and I can only guess is one of the many bad quirks of C code when it is compiled. When timing in code is more important then size, it's ALWAYS best to avoid using too many branches and jumps.

One of many tricks to achieve "repeating code" over "loops" is by writing a generator. Something i'm heavily a user of for smaller systems. For example, you want to change the RGB values of a bitmap while it is being displayed, but in a cycle critical environment. A C code could do this, but when compiled, C code is ALWAYS inefficient when faced against well written assembler code.

I would like to address the Intel Optimising C compiler. As a former Intel partner, I used their compiler a lot when toying with various compression routines made by Matt Mahoney. His compression used context modelling with neural networking - which meant it was very slow, and very CPU intensive. The overall gains by using the Intel Optimising compiler were there and on average about 10% faster code at the expense of more code space being used by a significant size increase in the app itself - it would unroll loops (which I explained earlier). HOWEVER - it was still C code and therefore could be optimised further. Yes, my hand-written code was over 15% faster with little extra app size increase - enough boasting.

In this case for the crunch routine, the actual size of the loop varies depending on the user input so we can not predict the size of the actual loop iterations. Instead, we minimise the loop itself as much as possible and remove as many "FOR" loops with "DO/WHILE". Even doing this will increase speed slightly at the cost of a few extra opcodes.

Given that the idea behind crunch is to generate wordlists, a very simple idea yet very time intensive, we can be forgiven for making the actual code itself large for the sake of increased speed. By removing as many "FOR" loops, and replacing as many "[LABEL]++" with "INC [LABEL]" commands, you will see a speed increase. This rule is universal and can also be applied to compression routines which you may well be using alongside crunch. If you can shave a month off a years generating, then that is a worthy saving and not unreasonable.