hprime

hprime is a fast prime generator based on the sieve of Eratosthenes.

I plan to present the code in 4 parts:

Part 1: Introduce 4 main optimisations used in hprime (reviewing)
Part 2: Show how the main optimisations can be combined (reviewing)
Part 3: How the hprime codebase is structured (not started)
Part 4: The optimisations as used in the hprime structure (not started)

Also see:

Misc

Code now on github

Introduction

This section introduces some of the optimisations that are used in hprime. The optimisations are all shown based off a simple implementation of the sieve of Eratosthenes (which is shown first). The idea is to show just the required changes so that each optimisation is easy to understand.

Simple Sieve of Eratosthenes

A simple sieve of Eratosthenes using a bitmap for storing primality with the optimisation of starting from the primes square

For a detailed explanation of the sieve of Eratosthenes, see the Wikipedia article.

Sieve of Eratosthenes using a bitmap

A simple sieve of Eratosthenes using a bitmap for storing primality with the optimisation of starting from the primes square

The following is a basic description of how the sieve works.

The primality of a number can be stored in one bit - the number is either prime or it is not. Here, a bitmap is used where the number is determined by the position in the bitmap and the primality of that number is determined by whether the bit is not set.

The bitmap is used to search for a prime from smallest to largest. When a prime is found, all multiples of that prime are marked off on the bitmap for the remainder of the desired range as being "not prime". By working from smallest to largest every number that is checked will have had all possible prime factors checked first, and so the primality is known.

(0 and 1 have been pre-marked as not prime and the implementation starts at 2, with 2 being the first prime).

Note that this already contains an optimisation in that instead of starting from prime * 2, it starts from prime * prime, as multiples for numbers smaller than prime will have already been marked off. However, this example is otherwise supposed to show the basic principle and is not an optimised version.

6.3s

   for (a = 2; a <= max; a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      num_primes++;

      for (r = a * a; r <= max; r += a)
         bitmap[r / 8] |= 1 << (r % 8);
   }

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>


static uint64_t
calc_primes(char *bitmap, uint64_t max)
{
   uint64_t a,
            r,
            num_primes = 0;


   for (a = 2; a <= max; a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      num_primes++;

      for (r = a * a; r <= max; r += a)
         bitmap[r / 8] |= 1 << (r % 8);
   }


   return num_primes;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/8 + 1, 1);

   printf("%ld\n", calc_primes(bitmap, max));

   free(bitmap);
   return EXIT_SUCCESS;
}

Generating and counting primes

A quick method of counting primes in order to focus on speed of generation

The above animation intended to both mirror that in Wikipedia, and also show how the mechanism can be used to discover all primes within the range. However it is easy to see in the example above that for a > 11, there are no more changes to the bitmap. This is because multiples are marked starting at a * a, and so any value of a greater than sqrt(max) will make no changes to the bitmap.

The rest of the examples below take advantage of this and break the task down in to two parts:

Generate the bitmap
Use the bitmap

The emphasis of the optimisations is on the first part, generating the bitmap.

As a baseline for demostrating using the bitmap, the primes are counted. This is relatively quick and simple as there is no need to know what number each bit represents, just how many bits are set.

The main algorithm here is the same as above except that it exits once reaching sqrt(max) and the primes are counted separately.

The count_primes() function can be viewed in the full code. There is no need to understand this function to continue, it just counts the number of bits up to (and including) the specified bit.

5.3s

   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (r = a * a; r <= max; r += a)
         bitmap[r / 8] |= 1 << (r % 8);
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>


static void
calc_primes(char *bitmap, uint64_t max)
{
   uint64_t a,
            r;


   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (r = a * a; r <= max; r += a)
         bitmap[r / 8] |= 1 << (r % 8);
   }


   /* mark bit 0 and 1 as not prime */
   bitmap[0] |= 0x3;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/8 + 1, 1);

   calc_primes(bitmap, max);

   printf("%ld\n", count_primes(bitmap, max));

   free(bitmap);
   return EXIT_SUCCESS;
}

Choosing the byte and bit

Avoiding extra calculations when choosing the byte and bit

In the above animation, the bytes boundaries are shown as well as the bits. A byte is of interest because it is the smallest addressable location for the CPU. There is no "bit pointer" that can be incremented by the prime each time, and there is no CPU instruction to directly set the required bit at an offset.

This means to "set the bit at location 49" actually requires work to determine the byte that the bit is located in, and then the bit location within that byte. This is the rather ugly bitmap[r/8] |= 1 << (r%8) line in the above example.

The task of marking off multiples can be thought of in terms of Modular Arithmetic.

The examples below demonstrate how all multiples can be marked off in an efficient way without calculating the divisor and remainder each time.

Mark every 8th multiple

Keep the same bitmask by marking every 8th multiple

A simple solution to reduce the work required to choose the byte and the bit is to set every 8th multiple of the prime.

This means, to find the next multiple, the resulting bit position is increased by a * 8 bits, which is equivilant to a * 8 / 8 or a bytes. As this divides evenly the bit position within the byte doesn't change (ie (8 * a) % 8 equals 0).

Setting every 8th multiple of the prime then needs to be done 8 times for the 8 different starting offsets.

Note: one small optimisation is to use a pointer instead of an array and offset, since the next byte position is being chosen based off the previous one.

Note: The bitmask only needs to be calculated once for each a and i, which the compiler can easily pick up

8s

   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (i = 0; i < 8; i++)
         for (bmp = bitmap + a * (a+i) / 8;  bmp < bitmap_end; bmp += a)
            *bmp |= 1 << ((a * (a+i)) % 8);
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>


static void
calc_primes(char *bitmap, uint64_t max)
{
   char     *bmp,
            *bitmap_end = bitmap + max/8 + 1;
   uint64_t  a;
   int       i;


   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (i = 0; i < 8; i++)
         for (bmp = bitmap + a * (a+i) / 8;  bmp < bitmap_end; bmp += a)
            *bmp |= 1 << ((a * (a+i)) % 8);
   }


   /* mark bit 0 and 1 as not prime */
   bitmap[0] |= 0x3;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/8 + 1, 1);

   calc_primes(bitmap, max);

   printf("%ld\n", count_primes(bitmap, max));

   free(bitmap);
   return EXIT_SUCCESS;
}

Mark all offsets at once

When marking every 8th multiple, process all starting offsets at the same time

The difference between the 8 initial byte offsets will remain the same as the next 8th multiple is marked for each (since all offsets increment by a bytes). Also from above the bit position does not change for each initial offset. This can be described by saying that the same pattern of 8 byte offsets and bitmasks will repeat every a bytes. These can be stored and applied over and over to the bitmap.

This is used below by continually incrementing a base pointer by a bytes, and describing the 8 offsets relative to the base pointer.

Instead of the base byte pointer being (a * a) / 8 as above, it is chosen as a * (a / 8), which means the byte offsets become simply a * i / 8 and the bitmasks a * i % 8. One way to think of this (without maths) is to note that a * (a / 8) is a multiple of a bytes so the pattern is the same as that which starts at 0.

Note: Starting at a * (a / 8) means the first byte is clobbered and needs to be corrected.

Note: Allowance has been made for the sets to overrun bitmap_end.

4.8s

   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (bmp = bitmap + a * (a / 8);  bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + (a * i / 8)) |= 1 << (a * i % 8);
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>


static void
calc_primes(char *bitmap, uint64_t max)
{
   char     *bmp,
            *bitmap_end = bitmap + max/8 + 1;
   uint64_t  a;
   int       i;


   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (bmp = bitmap + a * (a / 8);  bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + (a * i / 8)) |= 1 << (a * i % 8);
   }


   /* Reset the first byte as it is clobbered by the above algorithm*/
   bitmap[0] = 0x53;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/8 + sqrtl(max) + 1, 1);

   calc_primes(bitmap, max);

   printf("%ld\n", count_primes(bitmap, max));

   free(bitmap);
   return EXIT_SUCCESS;
}

Repeating patterns for small primes

Taking advantage of repeating patterns for small primes

It has been shown above that when marking off all multiples there is a pattern of bit/bytes that repeats every a bytes. For smaller primes where there is more than one bit per register-width, the pattern can be used to reduce the work required by using a single instruction to set all required bits in that register-width at once.

Applying small primes using 8 byte ints

An example of directly setting the pattern for the lower primes

All odd primes (or all odd numbers) will not divide evenly in any register-width which are all powers of 2. Instead the pattern will repeat every a bytes, or a ints, or a 32-bit ymm registers. One way to think of it is that 8 repeating patterns of a bytes will fit in a 8-byte integers.

For primes less than 16, there are enough registers so that the a parts of a repeating pattern can be held in a registers of the chosen register width. These registers can then be continually applied one after the other in a loop with no manipulation required.

In the animation, 8 byte ints are used for the primes 2 and 3 where the bit patterns for the lower primes have already been pre-set. This is just to show the effective speedup that this method can introduce. It is especially noticeable since in the initial animation, marking off the primes 2 and 3 were by far the slowest operation.

If viewing the full code, the patterns are dynamically created based on the prime and then repeatedly applied for all primes less than 16

Note: gcc doesn't use the registers as described for this example, but will for others. It can be forced - see here

4.5s

   for (i = 0; i < 6; i++) {
      bzero(pattern, sizeof pattern);
      a = low_a[i];

      for (r = 0; r < a*64; r += a)
         pattern[r / 8] |= 1 << r % 8;

      for (bmp = bitmap; bmp < bitmap_end; bmp += a * sizeof(uint64_t))
         for (j = 0; j < a; j++)
            *((uint64_t *)bmp + j) |= *((uint64_t *)pattern + j);
   }

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <inttypes.h>


static void
calc_primes(char *bitmap, uint64_t max)
{
   uint64_t  a,
             r,
             i,
             j;
   int       low_a[] = {2,3,5,7,11,13};
   char      pattern[16*8],
            *bmp,
            *bitmap_end = bitmap + max/8 + 1;


   for (i = 0; i < 6; i++) {
      bzero(pattern, sizeof pattern);
      a = low_a[i];

      for (r = 0; r < a*64; r += a)
         pattern[r / 8] |= 1 << r % 8;

      for (bmp = bitmap; bmp < bitmap_end; bmp += a * sizeof(uint64_t))
         for (j = 0; j < a; j++)
            *((uint64_t *)bmp + j) |= *((uint64_t *)pattern + j);
   }


   for (a = 17; a < sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (r = a * a; r <= max; r += a)
         bitmap[r / 8] |= 1 << (r % 8);
   }


   /* Reset the first bytes as they are clobbered by the above algorithm*/
   bitmap[0] = 0x53;
   bitmap[1] = 0xD7;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/8 + 16*8 + 1, 1);

   calc_primes(bitmap, max);

   printf("%ld\n", count_primes(bitmap, max));

   free(bitmap);
   return EXIT_SUCCESS;
}

Processing a block at a time

Maximising use of CPU caches by processing a block of numbers at a time

When calculating the primes to 1 billion the bitmap needs to be ~120MB which is much larger than the cpu caches. Currently, when setting the multiples, the full 120MB of the array is traversed for each prime and so needs to be read from and written out to main memory.

Instead, the idea is to calculate the bitmap in blocks so that only a portion of the full bitmap is marked off for each prime at a time.

Simple solution

Simple and direct example of processing a block at a time

The normal loop is used however an outside loop is added which marches through the blocks of numbers. Inside the normal loop multiples are marked off to the end of the current block. The main change is in selecting the start position. If a * a is less than start of the current block, then instead the first multiple is found using (block_start + a - 1) / a * a.

Only full blocks of 32*1024*8 are generated. This will generate more than maxbut the bitmap is still only counted up to max

1.8s

   for (cur = 0; cur <= max; cur += 32*1024*8) {

      for (a = 2; a <= sqrtl(cur + 32*1024*8); a++) {
         if (bitmap[a / 8] & 1 << (a % 8))
            continue;

         for (r = MAX(a * a, (cur + a - 1) / a * a);
              r < cur + 32*1024*8;
              r += a)
            bitmap[r / 8] |= 1 << (r % 8);
      }
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>

#define MAX(A,B) ((A) > (B) ? (A) : (B))
#define MIN(A,B) ((A) < (B) ? (A) : (B))

void
calc_primes(char *bitmap, uint64_t max)
{
   uint64_t  a,
             r,
             cur;


   for (cur = 0; cur <= max; cur += 32*1024*8) {

      for (a = 2; a <= sqrtl(cur + 32*1024*8); a++) {
         if (bitmap[a / 8] & 1 << (a % 8))
            continue;

         for (r = MAX(a * a, (cur + a - 1) / a * a);
              r < cur + 32*1024*8;
              r += a)
            bitmap[r / 8] |= 1 << (r % 8);
      }
   }


   /* mark bit 0 and 1 as not prime */
   bitmap[0] |= 0x3;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(1, max/8 + 32*1024);

   calc_primes(bitmap, max);

   printf("%ld\n", count_primes(bitmap, max));

   free(bitmap);
   return EXIT_SUCCESS;
}

Storing state between blocks

Storing the primes and offsets to avoid initialising betwen blocks

Instead of extracting the primes from the bitmap each time, they can be stored in a list where it is easier to retrive them. Likewise, the final mutiple from the previous block can be stored and that value can be used at the start of the next block.

In this example the sieving primes are calculated and added to the list first.

1.6s

   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (r = a * a; r < sqrtl(max); r += a)
         bitmap[r / 8] |= 1 << (r % 8);

      as[cnt] = a; rs[cnt] = r; cnt++;
   }


   for (cur = 0; cur <= max; cur += 32*1024*8) {

      for (i = 0; i < cnt; i++)
         for (; rs[i] < cur + 32*1024*8; rs[i] += as[i])
            bitmap[rs[i]/8] |= 1 << (rs[i] % 8);

   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>

#define MIN(A,B) ((A) < (B) ? (A) : (B))

static void
calc_primes(char *bitmap, uint64_t max)
{
   uint32_t *as = malloc(sizeof (uint32_t) * (sqrtl(max) / 4 + 1000));
   uint64_t *rs = malloc(sizeof (uint64_t) * (sqrtl(max) / 4 + 1000));
   int       cnt = 0,
             i;
   uint64_t  a,
             r,
             cur;


   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (r = a * a; r < sqrtl(max); r += a)
         bitmap[r / 8] |= 1 << (r % 8);

      as[cnt] = a; rs[cnt] = r; cnt++;
   }


   for (cur = 0; cur <= max; cur += 32*1024*8) {

      for (i = 0; i < cnt; i++)
         for (; rs[i] < cur + 32*1024*8; rs[i] += as[i])
            bitmap[rs[i]/8] |= 1 << (rs[i] % 8);

   }


   /* mark bit 0 and 1 as not prime */
   bitmap[0] |= 0x3;
   free(as);
   free(rs);
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(1, max/8 + 32*1024);

   calc_primes(bitmap, max);

   printf("%ld\n", count_primes(bitmap, max));

   free(bitmap);
   return EXIT_SUCCESS;
}

Wheel Optimisations

Avoiding unneeded operations and improving storage

Wheel factorisation can be used to show that only a certain subset of numbers modulo the wheel range can possibly be prime.

This is applied in two ways:

Reducing the storage - only store possible primes
Reducing calculations - only calculate possibly prime multiples

Avoiding even numbers

A simple wheel based on 1 possible prime in every 2 numbers

One observation with prime numbers is that except for the prime 2, there are no even prime numbers. That is, all even numbers are a multiple of 2 and so are not prime. When marking the prime 2 in the simple example, every second bit is marked as not prime. Also, when marking multiples, every second multiple of a prime will land on an even bit. Both these sets of work can be made redundant by ignoring even numbers in the code and only dealing with odd numbers (and treating the prime 2 as a special case).

This is a simple form of wheel factorisation. Except for the prime 2, there is only one possible prime in every 2 numbers.

The bitmap is made to represent only odd numbers. If r is odd, the position in the bitmap is r / 2. The bit at position x refers to the number x * 2 + 1. The bitmap cannot represent even numbers.

Even results can be skipped by marking every second multiple. Instead of incrementing the result r += a it can be incremented by r += a*2.

2.5s

   for (a = 3; a <= sqrtl(max); a += 2) {
      if (bitmap[a/2/8] & 1 << ((a/2) % 8))
         continue;

      for (r = a * a; r <= max; r += a * 2)
         bitmap[r/2/8] |= 1 << ((r/2) % 8);
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>


static void
calc_primes(char *bitmap, uint64_t max)
{
   uint64_t a,
            r;


   for (a = 3; a <= sqrtl(max); a += 2) {
      if (bitmap[a/2/8] & 1 << ((a/2) % 8))
         continue;

      for (r = a * a; r <= max; r += a * 2)
         bitmap[r/2/8] |= 1 << ((r/2) % 8);
   }


   /* mark 1 as not prime */
   bitmap[0] |= 0x1;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/8/2 + 1, 1);

   calc_primes(bitmap, max);

   printf("%ld\n", count_primes(bitmap, (max-1)/2 + 1));

   free(bitmap);
   return EXIT_SUCCESS;
}

8/30 wheel

A wheel where there is 8 possible primes in every 30 numbers

Above it was noted that after marking multiples of the prime 2, every second bit was set. The same thought can be expanded to note that after the primes 2,3,5 are set they produce a pattern of set bits that repeats every 30 numbers (2 * 3 * 5 = 30). In that pattern there are only 8 possible primes that will not be set ((r % 30) = 1,7,11,13,17,19,23,29). This is the 8/30 wheel (Wheel factorisation).

One complexity of this wheel compared to the simple example above is in determining the bit position. The possible primes are not spread evenly over the 30 numbers and so a simple division will not work to find the correct bit.

The solution involves breaking the result in to r / 30 and r % 30. The base bit position can be found by multiplying r / 30 by the number of possible primes, and a specialised function can be used to convert r % 30 in to a bit position.

The absolute bit position is therefore r / 30 * 8 + mod_to_ind(r % 30).

Here the 8/30 wheel is extremely convenient as the 8 possible primes map directly to one byte. If there were 9 possible primes, an extra level of modular arithmetic would be required - one to convert to the absolute bit position in the bitmap, and another to convert to the byte/bit position in order to set that bit. Instead, the term r / 30 is the byte position and the term mod_to_ind(r % 30) is the final bit position.

Below the function mod_to_ind() is implemented as an ad-hoc formula bit = ((r % 30) * 8 / 30). This just happens to produce the correct bit position for each possible prime.

When iterating through the bitmap, an array of the difference between the value each bit represents is used and this is cycled through, so the next possible prime can be found using a += diff[a_i++%8]. When calculating the multiples, in the solution r = a * b both a and b should not be multiples of (2,3,5). So a virtual b is also iterating through possible primes, and the next result is found using r += a * diff[b_i++%8]

1.8s

   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << ((a % 30) * 8 / 30))
         continue;

      for (r = a * a, b_i = a_i; r <= max; r += a * diff[b_i++%8])
         bitmap[r / 30] |= 1 << ((r % 30) * 8 / 30);
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>


const char diff[] = {6,4,2,4,2,4,6,2};

static void
calc_primes(char *bitmap, uint64_t max)
{
   uint64_t  a,
             r;
   int       a_i,
             b_i;


   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << ((a % 30) * 8 / 30))
         continue;

      for (r = a * a, b_i = a_i; r <= max; r += a * diff[b_i++%8])
         bitmap[r / 30] |= 1 << ((r % 30) * 8 / 30);
   }


   /* mark 1 as not prime */
   bitmap[0] |= 0x1;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/30 + 1, 1);
   uint64_t  adj_max;
   int       a_i;

   /* adjust max to a multiple of 2,3,5 */
   for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0;
        adj_max + diff[a_i%8] <= max;
        adj_max += diff[a_i++%8])
      ;

   calc_primes(bitmap, adj_max);

   printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + (adj_max % 30) * 8 / 30) + 3);

   free(bitmap);
   return EXIT_SUCCESS;
}

Introduction

This section walks through the changes required to combine the different optimisations talked about in part 1.

The final example in this section generates primes to 10^9 in 125ms on my machine, compared to 116ms for primesieve and 79ms for hprime. This shows that these main optimisations are enough to get most of the way there.

Choosing the byte and bit in the 8/30 wheel

The examples that were used to describe efficiently marking of bits for the simple bitmap also apply to the 8/30 wheel bitmap.

It was previously shown that in the simple bitmap every 8th multiple of a prime would mark the same bit, and that they would be a bytes apart. With the 8/30 wheel this happens every 30th multiple. This can be described in the same way as being because (a * 30) / 30 is a bytes, and (a * 30) % 30 is 0.

Note that there is no need to mark every 30th multiple for all 30 starting offsets, but only the 8 possible primes described in the bitmap. This is also similar to the simple example where 8 multiples were marked at a time. The main difference is that the difference between the multiples is not 1 (and is not constant).

Mark 8 starting offsets at a time

Marking every 30th multiple for the 8 starting possible primes

This follows the example Mark all offsets at once from the simple bitmap but for the 8/30 wheel.

The 8 offsets bytes and bitmaps depend on the value that the bit at i represents. This is looked up using an array bval (bit value) bval[] = {1,7,11,13,17,19,23,29}.

The function num2bit() is also introduced to perform (num % 30) * 8 / 30

Also, the offsets and bitmasks are now explicitly stored in an array.

970ms

   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i % 8))
         continue;

      for (i = 0; i < 8; i++) {
         offsets[i] = a * bval[i] / 30;
         masks[i] = 1 << num2bit(a * bval[i]);
      }

      for (bmp = bitmap + a * (a / 30);  bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + offsets[i]) |= masks[i];
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>

const char diff[] = {6,4,2,4,2,4,6,2};
const char bval[] = {1,7,11,13,17,19,23,29};

static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; }


static void
calc_primes(char *bitmap, uint64_t max)
{
   char     *bmp,
            *bitmap_end = bitmap + max / 30 + 1;
   uint64_t  a;
   int       i,
             a_i,
             offsets[8];
   char      masks[8];


   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i % 8))
         continue;

      for (i = 0; i < 8; i++) {
         offsets[i] = a * bval[i] / 30;
         masks[i] = 1 << num2bit(a * bval[i]);
      }

      for (bmp = bitmap + a * (a / 30);  bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + offsets[i]) |= masks[i];
   }


   /* Fix first byte which is clobbered above */
   bitmap[0] = 0x1;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(1, max/30 + sqrtl(max) + 1);
   uint64_t  adj_max;
   int       a_i;

   /* adjust max to a multiple of 2,3,5 */
   for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0;
        adj_max + diff[a_i%8] <= max;
        adj_max += diff[a_i++%8])
      ;

   calc_primes(bitmap, adj_max);

   printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3);

   free(bitmap);
   return EXIT_SUCCESS;
}

Marking 8 offsets with constant bitmask

Forcing the bitmask order to allow compiled in bitmasks

Each prime in the 8/30 wheel shares no prime factors with 30 (2,3,5). This means that with 30 consecutive multiples, no a % 30 value will be the same.

This is used below to store the 8 offsets in an array according to the bit position of the result. This means the bit positions will be marked in the same order for every prime and can be encoded directly in the instructions. This will still use 8 registers for holding the 8 offsets, but that is not quite as critical as using 16 registers.

Note that this exact solution couldn't be done with the simple bitmap because of the prime 2, which divides evenly in to 8 meaning not all 8 bit positions are visited.

See the inner loop in assembly here

970ms

   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i%8))
         continue;

      for (i = 0; i < 8; i++)
         offsets[num2bit(a * bval[i])] = a * bval[i] / 30;

      for (bmp = bitmap + a * (a / 30); bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + offsets[i]) |= 1 << i;
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>

const char diff[] = {6,4,2,4,2,4,6,2};
const char bval[] = {1,7,11,13,17,19,23,29};

static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; }


static void
calc_primes(char *bitmap, uint64_t max)
{
   char     *bmp,
            *bitmap_end = bitmap + max / 30 + 1;
   uint64_t  a;
   int       i,
             a_i,
             offsets[8];


   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i%8))
         continue;

      for (i = 0; i < 8; i++)
         offsets[num2bit(a * bval[i])] = a * bval[i] / 30;

      for (bmp = bitmap + a * (a / 30); bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + offsets[i]) |= 1 << i;
   }


   /* Fix first byte which is clobbered above */
   bitmap[0] = 0x1;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/30 + sqrtl(max) + 1, 1);
   uint64_t  adj_max;
   int       a_i;

   /* adjust max to a multiple of 2,3,5 */
   for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0;
        adj_max + diff[a_i%8] <= max;
        adj_max += diff[a_i++%8])
      ;

   calc_primes(bitmap, adj_max);

   printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3);

   free(bitmap);
   return EXIT_SUCCESS;
}

Repeat patterns for lower primes

Adding in calculations for lower primes on the 8/30 wheel

As shown above, the pattern still repeats every a bytes on the 8/30 wheel. This means there is very little that needs changing in this section. One main point is that there are now fewer lower primes since the primes 2,3,5 do not need to be marked at all.

Repeat patterns for lower primes

Adding in calculations for lower primes on the 8/30 wheel

This follows the example Repeat patterns for lower primes in part 1. The main change is incorporating the 8/30 wheel.

This optimisation is combined with the example above.

960ms

   for (i = 1; i < 4; i++) {
      bzero(pattern, sizeof pattern);
      for (r = bval[i], a_i = 0; r < bval[i] * 30 * 8; r += bval[i] * diff[a_i++ % 8])
         pattern[r / 30] |= 1 << num2bit(r);

      for (bmp = bitmap; bmp < bitmap_end; bmp += bval[i] * sizeof(uint64_t))
         for (j = 0; j < bval[i]; j++)
            *((uint64_t *)bmp + j) |= *((uint64_t *)pattern + j);
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <inttypes.h>

const char diff[] = {6,4,2,4,2,4,6,2};
const char bval[] = {1,7,11,13,17,19,23,29};

static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; }


static void
calc_primes(char *bitmap, uint64_t max)
{
   char     *bmp,
            *bitmap_end = bitmap + max / 30 + 1,
             pattern[16 * 8];
   uint64_t  a;
   int       i,
             j,
             r,
             a_i,
             offsets[8];


   for (i = 1; i < 4; i++) {
      bzero(pattern, sizeof pattern);
      for (r = bval[i], a_i = 0; r < bval[i] * 30 * 8; r += bval[i] * diff[a_i++ % 8])
         pattern[r / 30] |= 1 << num2bit(r);

      for (bmp = bitmap; bmp < bitmap_end; bmp += bval[i] * sizeof(uint64_t))
         for (j = 0; j < bval[i]; j++)
            *((uint64_t *)bmp + j) |= *((uint64_t *)pattern + j);
   }


   for (a = 17, a_i = 4; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i%8))
         continue;

      for (i = 0; i < 8; i++)
         offsets[num2bit(a * bval[i])] = a * bval[i] / 30;

      for (bmp = bitmap + a * (a / 30); bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + offsets[i]) |= 1 << i;
   }

   /* Fix first byte which is clobbered above */
   bitmap[0] = 0x1;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/30 + sqrtl(max) + 16*8, 1);
   uint64_t  adj_max;
   int       a_i;

   /* adjust max to a multiple of 2,3,5 */
   for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0;
        adj_max + diff[a_i%8] <= max;
        adj_max += diff[a_i++%8])
      ;

   calc_primes(bitmap, adj_max);

   printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3);

   free(bitmap);
   return EXIT_SUCCESS;
}

Calculating a block at a time

Finally all optimisations are added together. This is done in 3 parts.

The final example counts primes to 10^9 in 125ms on my i7 5820k @ 4.4GHz. This is only a fraction behind primesieve-5.2 at 116ms.

Block at a time - step 1

Block at a time for the 8/30 wheel

This follows the example Processing a block at a time in part 1, while incorporating the 8/30 wheel with with all offsets being marked

197ms

   for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) {

      for (a = 7, a_i = 1; a <= sqrtl(cur + 32*1024*30); a += diff[a_i++%8]) {
         if (bitmap[a / 30] & 1 << (a_i%8))
            continue;

         for (i = 0; i < 8; i++)
            offsets[num2bit(a * bval[i])] = a * bval[i] / 30;

         for (bmp = bitmap + MAX(a * (a / 30), (cur / 30) / a * a);
              bmp < bitmap_end;
              bmp += a) {
            for (i = 0; i < 8; i++)
               *(bmp + offsets[i]) |= 1 << i;
         }
      }

      /* First byte gets clobbered for the first block */
      if (cur == 0)
         bitmap[0] = 0x1;
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>

#define MAX(A,B) ((A) > (B) ? (A) : (B))

const char diff[] = {6,4,2,4,2,4,6,2};
const char bval[] = {1,7,11,13,17,19,23,29};

static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; }


static void
calc_primes(char *bitmap, uint64_t max)
{
   char     *bmp,
            *bitmap_end = bitmap + 32*1024;
   uint64_t  a,
             cur;
   int       i,
             a_i,
             offsets[8];

   for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) {

      for (a = 7, a_i = 1; a <= sqrtl(cur + 32*1024*30); a += diff[a_i++%8]) {
         if (bitmap[a / 30] & 1 << (a_i%8))
            continue;

         for (i = 0; i < 8; i++)
            offsets[num2bit(a * bval[i])] = a * bval[i] / 30;

         for (bmp = bitmap + MAX(a * (a / 30), (cur / 30) / a * a);
              bmp < bitmap_end;
              bmp += a) {
            for (i = 0; i < 8; i++)
               *(bmp + offsets[i]) |= 1 << i;
         }
      }

      /* First byte gets clobbered for the first block */
      if (cur == 0)
         bitmap[0] = 0x1;
   }

}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/30 + 2*32*1024 + sqrtl(max) + 1, 1);
   uint64_t  adj_max;
   int       a_i;

   /* adjust max to a multiple of 2,3,5 */
   for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0;
        adj_max + diff[a_i%8] <= max;
        adj_max += diff[a_i++%8])
      ;

   calc_primes(bitmap, adj_max);

   printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3);

   free(bitmap);
   return EXIT_SUCCESS;
}

Block at a time - step 2

Pre-processing sieving primes and storing calculations

This follows the example Storing state between blocks in part 1, while incorporating the 8/30 wheel with with all offsets being marked

142ms

   /*
    * Find sieving primes
    */
   bitmap_end = bitmap + (int64_t)sqrtl(max) + 1;
   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i%8))
         continue;

      as[cnt] = a;
      for (i = 0; i < 8; i++)
         offs[cnt*8 + num2bit(a * bval[i])] = a * bval[i] / 30;

      for (bmp = bitmap + (uint64_t)a * (a / 30); bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + offs[cnt*8 + i]) |= 1 << i;

      bmps[cnt++] = (bmp - bitmap);
   }

   /*
    * Calculate blocks
    */
   bitmap_end = bitmap + 32*1024;
   for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) {

      for (i = 0; i < cnt; i++)
         for (; bmps[i] < cur/30 + 32*1024; bmps[i] += as[i])
            for (j = 0; j < 8; j++)
               *(bitmap + bmps[i] + offs[i*8 + j]) |= 1 << j;
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>

#define MAX(A,B) ((A) > (B) ? (A) : (B))

const char diff[] = {6,4,2,4,2,4,6,2};
const char bval[] = {1,7,11,13,17,19,23,29};

static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; }


static void
calc_primes(char *bitmap, uint64_t max)
{
   char     *bmp,
            *bitmap_end;
   uint64_t  a,
             cur;
   uint32_t *as   = malloc(sizeof(uint32_t) * (sqrtl(max) / 4 + 1000)),
            *bmps = malloc(sizeof(uint32_t) * (sqrtl(max) / 4 + 1000)),
            *offs = malloc(sizeof(uint32_t) * 8 * (sqrtl(max) / 4 + 1000));
   int       i,
             j,
             a_i,
             cnt = 0;


   /*
    * Find sieving primes
    */
   bitmap_end = bitmap + (int64_t)sqrtl(max) + 1;
   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i%8))
         continue;

      as[cnt] = a;
      for (i = 0; i < 8; i++)
         offs[cnt*8 + num2bit(a * bval[i])] = a * bval[i] / 30;

      for (bmp = bitmap + (uint64_t)a * (a / 30); bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + offs[cnt*8 + i]) |= 1 << i;

      bmps[cnt++] = (bmp - bitmap);
   }

   /*
    * Calculate blocks
    */
   bitmap_end = bitmap + 32*1024;
   for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) {

      for (i = 0; i < cnt; i++)
         for (; bmps[i] < cur/30 + 32*1024; bmps[i] += as[i])
            for (j = 0; j < 8; j++)
               *(bitmap + bmps[i] + offs[i*8 + j]) |= 1 << j;
   }


   /* First byte gets clobbered for the first block */
   bitmap[0] = 0x1;

   free(as);
   free(bmps);
   free(offs);
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/30 + 2*32*1024 + sqrtl(max) + 1, 1);
   uint64_t  adj_max;
   int       a_i;

   /* adjust max to a multiple of 2,3,5 */
   for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0;
        adj_max + diff[a_i%8] <= max;
        adj_max += diff[a_i++%8])
      ;

   calc_primes(bitmap, adj_max);

   printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3);

   free(bitmap);
   return EXIT_SUCCESS;
}

Block at a time - step 3

Also marking lower primes

This adds in the lower prime optimisations

127ms

   /* Init lower primes */
   for (i = 1; i < 4; i++)
      for (r = bval[i], a_i = 0; r < bval[i] * 30 * 8; r += bval[i] * diff[a_i++ % 8])
         pattern[i-1][r / 30] |= 1 << num2bit(r);

   /*
    * Find sieving primes
    */
   bitmap_end = bitmap + (int64_t)sqrtl(max) + 1;

   for (i = 1; i < 4; i++)
      for (bmp = bitmap; bmp < bitmap_end; bmp += bval[i]*8)
         for (j = 0; j < bval[i]; j++)
            *((uint64_t *)bmp + j) |= *((uint64_t *)pattern[i-1] + j);

   for (a = 17, a_i = 4; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i%8))
         continue;

      as[cnt] = a;
      for (i = 0; i < 8; i++)
         offs[cnt*8 + num2bit(a * bval[i])] = a * bval[i] / 30;

      for (bmp = bitmap + (uint64_t)a * (a / 30); bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + offs[cnt*8 + i]) |= 1 << i;

      bmps[cnt++] = (bmp - bitmap);
   }

   /*
    * Calculate blocks
    */
   bitmap_end = bitmap + 32*1024;
   for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) {

      for (i = 1; i < 4; i++)
         for (bmp = bitmap + cur/30/(8*bval[i])*(bval[i]*8); bmp < bitmap_end; bmp += bval[i]*8)
            for (j = 0; j < bval[i]; j++)
               *((uint64_t *)bmp + j) |= *((uint64_t *)pattern[i-1] + j);

      for (i = 0; i < cnt; i++)
         for (; bmps[i] < cur/30 + 32*1024; bmps[i] += as[i])
            for (j = 0; j < 8; j++)
               *(bitmap + bmps[i] + offs[i*8 + j]) |= 1 << j;
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>

#define MAX(A,B) ((A) > (B) ? (A) : (B))

const char diff[] = {6,4,2,4,2,4,6,2};
const char bval[] = {1,7,11,13,17,19,23,29};

static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; }


static void
calc_primes(char *bitmap, uint64_t max)
{
   char     *bmp,
            *bitmap_end,
             pattern[3][16*8] = {{0}};
   uint64_t  a,
             cur;
   uint32_t *as   = malloc(sizeof(uint32_t) * (sqrtl(max) / 4 + 1000)),
            *bmps = malloc(sizeof(uint32_t) * (sqrtl(max) / 4 + 1000)),
            *offs = malloc(sizeof(uint32_t) * 8 * (sqrtl(max) / 4 + 1000));
   int       i,
             j,
             r,
             a_i,
             cnt = 0;


   /* Init lower primes */
   for (i = 1; i < 4; i++)
      for (r = bval[i], a_i = 0; r < bval[i] * 30 * 8; r += bval[i] * diff[a_i++ % 8])
         pattern[i-1][r / 30] |= 1 << num2bit(r);

   /*
    * Find sieving primes
    */
   bitmap_end = bitmap + (int64_t)sqrtl(max) + 1;

   for (i = 1; i < 4; i++)
      for (bmp = bitmap; bmp < bitmap_end; bmp += bval[i]*8)
         for (j = 0; j < bval[i]; j++)
            *((uint64_t *)bmp + j) |= *((uint64_t *)pattern[i-1] + j);

   for (a = 17, a_i = 4; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i%8))
         continue;

      as[cnt] = a;
      for (i = 0; i < 8; i++)
         offs[cnt*8 + num2bit(a * bval[i])] = a * bval[i] / 30;

      for (bmp = bitmap + (uint64_t)a * (a / 30); bmp < bitmap_end; bmp += a)
         for (i = 0; i < 8; i++)
            *(bmp + offs[cnt*8 + i]) |= 1 << i;

      bmps[cnt++] = (bmp - bitmap);
   }

   /*
    * Calculate blocks
    */
   bitmap_end = bitmap + 32*1024;
   for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) {

      for (i = 1; i < 4; i++)
         for (bmp = bitmap + cur/30/(8*bval[i])*(bval[i]*8); bmp < bitmap_end; bmp += bval[i]*8)
            for (j = 0; j < bval[i]; j++)
               *((uint64_t *)bmp + j) |= *((uint64_t *)pattern[i-1] + j);

      for (i = 0; i < cnt; i++)
         for (; bmps[i] < cur/30 + 32*1024; bmps[i] += as[i])
            for (j = 0; j < 8; j++)
               *(bitmap + bmps[i] + offs[i*8 + j]) |= 1 << j;
   }


   /* First byte gets clobbered for the first block */
   bitmap[0] = 0x1;

   free(as);
   free(bmps);
   free(offs);
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/30 + 2*32*1024 + sqrtl(max) + 1, 1);
   uint64_t  adj_max;
   int       a_i;

   /* adjust max to a multiple of 2,3,5 */
   for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0;
        adj_max + diff[a_i%8] <= max;
        adj_max += diff[a_i++%8])
      ;

   calc_primes(bitmap, adj_max);

   printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3);

   free(bitmap);
   return EXIT_SUCCESS;
}

Introduction

WIP

Introduction

WIP

Introduction

Environment

The examples were all built with gcc (or gcc-5) on ubuntu using the following command:

gcc-5 -g -Wall -march=native -O3 test_xxx.c -lm -o test-xxx

The times given are for my machine i7-5820K at 4.4GHz

Ubuntu had trouble ramping the cpu up fast enough so I turned all cpus to 'performance' using cpu-freq

Interesting information

Things that didn't really fit elsewhere

This section is a scratchpad for things I think of that are interesting but that I can't fit elsewhere. It is not expected to make sense outside of the context of the main parts (read the parts first).Some short interesting points are:

For calculating primes up to 2^64 the largest sieving prime will fit in 2^32. (but it is a pain casting the result of (a * a) to 64bits in the examples)

Breaking down the task

If the sieving primes are grouped by powers of 2, the workload from each group is approximately the same. If the prime doubles, then half the amount of bits need to be marked off in the bitmap. However, the number of primes falling within that power of 2 will approximately be double.

Why 10^9?

Why do benchmarks against 10^9

The simple answer is that 10^8 is too quick, and 10^10 is too slow

Another answer is that the sqrt(10^9) is 31622, which is close to the block size used for the inner block. For sieving primes larger than this there is not a full repeat pattern within the one block, which means a different approach is needed.

bytes, ints, 32bit registers all a power of 2

As an aside, for the prime 2 the pattern repeats every byte, and not just every 2 bytes. This is because 2 is a prime factor of 8 (the only prime factor) and divides evenly. Every other prime (in fact every odd number) shares no prime factors with 8 and so the pattern will only repeat every a bytes. This also applies to all other register widths that are used, as they are all powers of 2. So for example, the pattern when aligned to an int boundary will repeat every a ints (and no less, except for the prime 2). This means there are no further shortcuts (ie smaller patterns) to reduce the work for the lower primes.

Assembly comparison

Comparing the assembly of test_004.c and test_005.c (and now test_013.c)

The assembly for test_004.c shows many registers being used in the inside loop:

  400a50:	80 08 01             	or     BYTE PTR [rax],0x1
  400a53:	0f b6 54 24 20       	movzx  edx,BYTE PTR [rsp+0x20]
  400a58:	44 08 3c 38          	or     BYTE PTR [rax+rdi*1],r15b
  400a5c:	46 08 34 18          	or     BYTE PTR [rax+r11*1],r14b
  400a60:	08 14 30             	or     BYTE PTR [rax+rsi*1],dl
  400a63:	48 8b 54 24 10       	mov    rdx,QWORD PTR [rsp+0x10]
  400a68:	46 08 2c 10          	or     BYTE PTR [rax+r10*1],r13b
  400a6c:	44 08 24 10          	or     BYTE PTR [rax+rdx*1],r12b
  400a70:	42 08 2c 08          	or     BYTE PTR [rax+r9*1],bpl
  400a74:	42 08 0c 00          	or     BYTE PTR [rax+r8*1],cl
  400a78:	48 01 d8             	add    rax,rbx
  400a7b:	48 39 44 24 08       	cmp    QWORD PTR [rsp+0x8],rax
  400a80:	77 ce                	ja     400a50

By comparison, the assembly for test_005.c shows fewer registers being used in the inside loop as the offsets and bitmasks are encoded directly in the machine instructions. There are now 8 different inside loop sections. This example is for the case where (a % 8) == 1:

  400c90:	80 0a 01             	or     BYTE PTR [rdx],0x1
  400c93:	80 09 02             	or     BYTE PTR [rcx],0x2
  400c96:	80 0c 42 04          	or     BYTE PTR [rdx+rax*2],0x4
  400c9a:	80 0c 41 08          	or     BYTE PTR [rcx+rax*2],0x8
  400c9e:	80 0c 82 10          	or     BYTE PTR [rdx+rax*4],0x10
  400ca2:	80 0c 81 20          	or     BYTE PTR [rcx+rax*4],0x20
  400ca6:	4c 01 d1             	add    rcx,r10
  400ca9:	80 0c 32 40          	or     BYTE PTR [rdx+rsi*1],0x40
  400cad:	42 80 0c 02 80       	or     BYTE PTR [rdx+r8*1],0x80
  400cb2:	4c 01 ca             	add    rdx,r9
  400cb5:	48 39 d5             	cmp    rbp,rdx
  400cb8:	73 d6                	jae    400c90

test_013.c (for the 8/30 wheel) shows the use of registers for the relative offsets but immediate bitmasks.

  400aa0:	42 80 0c 18 01       	or     BYTE PTR [rax+r11*1],0x1
  400aa5:	42 80 0c 10 02       	or     BYTE PTR [rax+r10*1],0x2
  400aaa:	42 80 0c 08 04       	or     BYTE PTR [rax+r9*1],0x4
  400aaf:	42 80 0c 00 08       	or     BYTE PTR [rax+r8*1],0x8
  400ab4:	80 0c 38 10          	or     BYTE PTR [rax+rdi*1],0x10
  400ab8:	80 0c 30 20          	or     BYTE PTR [rax+rsi*1],0x20
  400abc:	80 0c 08 40          	or     BYTE PTR [rax+rcx*1],0x40
  400ac0:	80 0c 10 80          	or     BYTE PTR [rax+rdx*1],0x80
  400ac4:	48 01 d8             	add    rax,rbx
  400ac7:	49 39 c5             	cmp    r13,rax
  400aca:	77 d4                	ja     400aa0

Modular Arithmetic

Another way of viewing the Modular Arithmetic

Modular Arithmetic.

An understanding of the terms in the last line will help to see what kinds of optimisations are available when marking off multiples.

  x = x_b*8 + x_i
where
  x_b = x / 8   (or ⌊x / 8⌋ ie int division)
  x_i = x % 8

  r = a * b
    = (a_b*8 + a_i) * (b_b*8 + b_i)
    = a * b_b * 8  +  a_b * b_i * 8  +  a_i * b_i

The terms in the last line show:

That incrementing b_b without changing b_i only affects the first term. This will move the result forward by a bytes.
That the bit position is only dependant on the last term. The first two terms are multiples of 8.
The last term can be calculated independantly of a_b and b_b - There are only 8 possible values each for a_i and b_i.

Cachegrind: hprime and primesieve to 10^9

The cachegrind results of primesieve and hprime to 10^9 were interesting.

Firstly it can really be seen that primesieve concentrates on reducing cache misses. The D1 misses are just 2%. Numbers wise, hprime has nearly 3 times as many D1 cache misses. One thing though is that this is only for reads and hprime actually has fewer write misses. Also, hprime has fewer references over all.

With branch mispredictions, primesieve has nearly 3 times as many branches and over 5 times as many mispredictions.

               -- primesieve --

I   refs:      736,461,590
I1  misses:          2,775
LLi misses:          2,455
I1  miss rate:        0.00%
LLi miss rate:        0.00%

D   refs:      403,559,909  (366,095,043 rd   + 37,464,866 wr)
D1  misses:      8,086,489  (  7,491,671 rd   +    594,818 wr)
LLd misses:         16,180  (      8,440 rd   +      7,740 wr)
D1  miss rate:         2.0% (        2.0%     +        1.6%  )
LLd miss rate:         0.0% (        0.0%     +        0.0%  )

LL refs:         8,089,264  (  7,494,446 rd   +    594,818 wr)
LL misses:          18,635  (     10,895 rd   +      7,740 wr)
LL miss rate:          0.0% (        0.0%     +        0.0%  )

Branches:      101,180,663  ( 99,395,683 cond +  1,784,980 ind)
Mispredicts:     5,737,876  (  3,971,367 cond +  1,766,509 ind)
Mispred rate:          5.7% (        4.0%     +       99.0%   )

-- hprime --

I   refs:      594,838,670
I1  misses:          2,055
LLi misses:          1,918
I1  miss rate:        0.00%
LLi miss rate:        0.00%

D   refs:      322,949,133  (298,099,527 rd   + 24,849,606 wr)
D1  misses:     21,850,722  ( 21,640,269 rd   +    210,453 wr)
LLd misses:          4,368  (      2,690 rd   +      1,678 wr)
D1  miss rate:         6.8% (        7.3%     +        0.8%  )
LLd miss rate:         0.0% (        0.0%     +        0.0%  )

LL refs:        21,852,777  ( 21,642,324 rd   +    210,453 wr)
LL misses:           6,286  (      4,608 rd   +      1,678 wr)
LL miss rate:          0.0% (        0.0%     +        0.0%  )

Branches:       37,871,700  ( 37,677,506 cond +    194,194 ind)
Mispredicts:       993,423  (    989,096 cond +      4,327 ind)
Mispred rate:          2.6% (        2.6%     +        2.2%   )

Failed examples

I decided these were too complex to make it in

Choosing a path

There are 8 possible paths when marking every a bits

After marking the first r at a given byte and bit, the byte and bit of the next multiple is calculated by moving forward a bits. This can be broken in to two parts, moving forward a / 8 bytes, followed by a % 8 bits. When adding the final bits this will either overflow in to the next byte ( + 1), or remain the same. This possible overflow byte and the final bit position depends entirely on the a % 8 value and the previously bit offset r % 8.

As shown above, after 8 multiples the same bit position will occur again, which completes a pattern. Each prime can be made to start marking multiples on the 0th bit by starting from r = a * (a / 8), and so can start from a known place. This means that each of the 8 possible a % 8 values has a set pattern of 8 bitmask and overflow byte results that can be used to set the next 8 multiples.

The code below chooses the path based on the value of a % 8. This is done in a way that allows the compiler to separately optimise each path. Note that all of the i * j terms in the mark_multiples function will be resolved at compile time.

One main advantage of this example is that it can encode the overflows and bit masks within the machine instructions themselves, instead of storing the information in registers (see here).

Note: This now involves a choice/branch for every prime that is encountered. This choice can be reduced by processing primes with the same a % 8 value at once. This would also reduce the work to extract the prime from the bitmap. However that part of the optimisation is not shown here.

4.7s

static inline void __attribute__((always_inline))
mark_multiples(char *bitmap, uint64_t max, uint64_t a, int j)
{
   char *bmp;
   int   i;

   for (bmp = bitmap + a * (a / 8); bmp <= bitmap + max/8; )
      for (i = 0; i < 8; i++, bmp += a/8 + ((i * j / 8) - (i-1) * j / 8))
         *bmp |= 1 << ((i * j) % 8);
}


...


   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      switch (a % 8) {
         case 0: mark_multiples(bitmap, max, a, 0); break;
         case 1: mark_multiples(bitmap, max, a, 1); break;
         case 2: mark_multiples(bitmap, max, a, 2); break;
         case 3: mark_multiples(bitmap, max, a, 3); break;
         case 4: mark_multiples(bitmap, max, a, 4); break;
         case 5: mark_multiples(bitmap, max, a, 5); break;
         case 6: mark_multiples(bitmap, max, a, 6); break;
         case 7: mark_multiples(bitmap, max, a, 7); break;
      }
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>


static inline void __attribute__((always_inline))
mark_multiples(char *bitmap, uint64_t max, uint64_t a, int j)
{
   char *bmp;
   int   i;

   for (bmp = bitmap + a * (a / 8); bmp <= bitmap + max/8; )
      for (i = 0; i < 8; i++, bmp += a/8 + ((i * j / 8) - (i-1) * j / 8))
         *bmp |= 1 << ((i * j) % 8);
}


static void
calc_primes(char *bitmap, uint64_t max)
{
   uint64_t a;


   for (a = 2; a <= sqrtl(max); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      switch (a % 8) {
         case 0: mark_multiples(bitmap, max, a, 0); break;
         case 1: mark_multiples(bitmap, max, a, 1); break;
         case 2: mark_multiples(bitmap, max, a, 2); break;
         case 3: mark_multiples(bitmap, max, a, 3); break;
         case 4: mark_multiples(bitmap, max, a, 4); break;
         case 5: mark_multiples(bitmap, max, a, 5); break;
         case 6: mark_multiples(bitmap, max, a, 6); break;
         case 7: mark_multiples(bitmap, max, a, 7); break;
      }
   }


   /* Reset the first byte as it is clobbered by the above algorithm*/
   bitmap[0] = 0x53;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/8 + sqrtl(max) + 1, 1);

   calc_primes(bitmap, max);

   printf("%ld\n", count_primes(bitmap, max));

   free(bitmap);
   return EXIT_SUCCESS;
}

Choosing a path

Choosing a path with the 8/30 wheel

This follows the example Choosing a path from the simple bitmap but for the 8/30 wheel.

In the simple example the next multiple was shown to be a / 8 bytes apart with a possible single byte adjustment.

In this example it is not the next multiple that needs to be marked but the multiple associated with the next possible prime, so the base increment is a / 30 * diff[i], and the number of overflow bytes can be up to diff[i] bytes.

The assembly shows that this is not optimised as well as in the simple example, and potentially not as well as the example above. However it will later be shown that this mechanism is still useful moving forwards.

Once again note that the (iv*jv) terms can all be resolved at compile time and is what the compiler can use to optimise the different branches.

970ms

static inline void __attribute__((always_inline))
mark_multiples (char *bitmap, char *bitmap_end, uint64_t a, int jv)
{
   char *bmp;
   int   i,
         iv;

   bmp = bitmap + a * (a/30) + a/30;
   while (bmp < bitmap_end) {
      for (iv = 1, i = 0; i < 8; iv += diff[i++]) {
         *bmp |= 1 << num2bit(iv * jv);
          bmp += a/30 * diff[i]  +  (((iv + diff[i]) * jv) / 30 - (iv * jv) / 30);
      }
   }
}



...


   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i%8))
         continue;

      switch (a_i % 8) {
         case 0 : mark_multiples(bitmap, bitmap_end, a,  1); break;
         case 1 : mark_multiples(bitmap, bitmap_end, a,  7); break;
         case 2 : mark_multiples(bitmap, bitmap_end, a, 11); break;
         case 3 : mark_multiples(bitmap, bitmap_end, a, 13); break;
         case 4 : mark_multiples(bitmap, bitmap_end, a, 17); break;
         case 5 : mark_multiples(bitmap, bitmap_end, a, 19); break;
         case 6 : mark_multiples(bitmap, bitmap_end, a, 23); break;
         case 7 : mark_multiples(bitmap, bitmap_end, a, 29); break;
      }
   }

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <inttypes.h>

const char diff[] = {6,4,2,4,2,4,6,2};

static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; }


static inline void __attribute__((always_inline))
mark_multiples (char *bitmap, char *bitmap_end, uint64_t a, int jv)
{
   char *bmp;
   int   i,
         iv;

   bmp = bitmap + a * (a/30) + a/30;
   while (bmp < bitmap_end) {
      for (iv = 1, i = 0; i < 8; iv += diff[i++]) {
         *bmp |= 1 << num2bit(iv * jv);
          bmp += a/30 * diff[i]  +  (((iv + diff[i]) * jv) / 30 - (iv * jv) / 30);
      }
   }
}


static void
calc_primes(char *bitmap, uint64_t max)
{
   char     *bitmap_end = bitmap + max/30 + 1;
   uint64_t  a;
   int       a_i;


   for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) {
      if (bitmap[a / 30] & 1 << (a_i%8))
         continue;

      switch (a_i % 8) {
         case 0 : mark_multiples(bitmap, bitmap_end, a,  1); break;
         case 1 : mark_multiples(bitmap, bitmap_end, a,  7); break;
         case 2 : mark_multiples(bitmap, bitmap_end, a, 11); break;
         case 3 : mark_multiples(bitmap, bitmap_end, a, 13); break;
         case 4 : mark_multiples(bitmap, bitmap_end, a, 17); break;
         case 5 : mark_multiples(bitmap, bitmap_end, a, 19); break;
         case 6 : mark_multiples(bitmap, bitmap_end, a, 23); break;
         case 7 : mark_multiples(bitmap, bitmap_end, a, 29); break;
      }
   }


   /* Reset first byte clobbered above */
   bitmap[0] = 0x1;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = calloc(max/30 + sqrtl(max) + 1, 1);
   uint64_t  adj_max;
   int       a_i;

   /* adjust max to a multiple of 2,3,5 */
   for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0;
        adj_max + diff[a_i%8] <= max;
        adj_max += diff[a_i++%8])
      ;

   calc_primes(bitmap, adj_max);

   printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3);

   free(bitmap);
   return EXIT_SUCCESS;
}

Reusing the block

Introducing abstractions and using the same block of memory each time

As explained in the Simple sieve of Eratosthenes section, there is a distinction between generating the primes and using the generated primes.

The block-at-a-time concept can also be applied to the "use the bitmap" function, which here is counting primes.

One reason to do this is that if the primes are both generated and consumed in blocks, there is no need to keep anything other than the current block in memory. The example below only allocates 32*1024 bytes for the bitmap.

To faciliate the caller counting primes in blocks some state is kept, which complicates the example. However, the idea is to present a clear concept of generating and then consuming a block at a time.

To reduce the changes and keep the general algorithm the same, re-using the block is achieved by moving the base pointer backwards so the new offsets now fall inside the same block. This keeps the algorithm the same, however is arguably not the best approach moving forwards.

1.6s

   for (i = 0; i < ctx->cnt; i++)
      for (; ctx->rs[i] < ctx->next; ctx->rs[i] += ctx->as[i])
         bitmap[ctx->rs[i] / 8] |= 1 << (ctx->rs[i] % 8);

   for (a = MAX(ctx->cur,ctx->as[ctx->cnt-1]); a < MIN(sqrtl(ctx->max), ctx->cur + 32*1024*8); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      ctx->as[ctx->cnt] = a; ctx->rs[ctx->cnt] = a * a; ctx->cnt++;
   }



...


   init_ctx(&ctx, max);

   while (calc_block(&ctx))
      num_primes += count_primes(ctx.bitmap, MIN(32*1024*8 - 1, max - ctx.cur));

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <inttypes.h>
#include <math.h>

#define MAX(A,B) ((A) > (B) ? (A) : (B))
#define MIN(A,B) ((A) < (B) ? (A) : (B))
#define CEIL_TO(A, B) (((A) + (B) - 1) / (B) * (B))

struct ctx {
   char     *bitmap;
   uint64_t  max;
   uint64_t  cur;
   uint64_t  next;

   uint32_t *as;
   uint64_t *rs;
   int       cnt;
};


static void
init_ctx (struct ctx *ctx, uint64_t max)
{
   uint32_t a;
   uint32_t r;

   ctx->bitmap = calloc(1, 32*1024 );
   ctx->as = malloc(sizeof (uint32_t) * (sqrtl(max) / 4 + 1000));
   ctx->rs = malloc(sizeof (uint64_t) * (sqrtl(max) / 4 + 1000));
   ctx->max = max;
   ctx->cur = 0;
   ctx->next = 0;
   ctx->cnt = 0;

   /* Bootstrap primes for the first block */
   for (a = 2; a <= sqrtl(32*1024*8); a++) {
      if (ctx->bitmap[a / 8] & 1 << (a % 8))
         continue;

      for (r = a * a; r < sqrtl(32*1024*8); r += a)
         ctx->bitmap[r / 8] |= 1 << (r % 8);

      ctx->as[ctx->cnt] = a; ctx->rs[ctx->cnt] = r; ctx->cnt++;
   }

   ctx->bitmap[0] |= 0x3;
}



static int
calc_block(struct ctx *ctx)
{
   int       i;
   char     *bitmap;
   uint32_t  a;

   ctx->cur = ctx->next;
   ctx->next += 32*1024*8;

   if (ctx->cur > ctx->max)
      return 0;

   if (ctx->cur != 0)
      bzero(ctx->bitmap, 32*1024);

   bitmap = ctx->bitmap - ctx->cur/8;


   for (i = 0; i < ctx->cnt; i++)
      for (; ctx->rs[i] < ctx->next; ctx->rs[i] += ctx->as[i])
         bitmap[ctx->rs[i] / 8] |= 1 << (ctx->rs[i] % 8);

   for (a = MAX(ctx->cur,ctx->as[ctx->cnt-1]); a < MIN(sqrtl(ctx->max), ctx->cur + 32*1024*8); a++) {
      if (bitmap[a / 8] & 1 << (a % 8))
         continue;

      ctx->as[ctx->cnt] = a; ctx->rs[ctx->cnt] = a * a; ctx->cnt++;
   }


   return 1;
}


static uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ul;
   uint64_t  num_primes = 0ul;
   struct ctx ctx;


   init_ctx(&ctx, max);

   while (calc_block(&ctx))
      num_primes += count_primes(ctx.bitmap, MIN(32*1024*8 - 1, max - ctx.cur));


   printf("%ld\n", num_primes);

   return EXIT_SUCCESS;
}

Vectors for simple bitmap

From Applying patterns for smaller primes section, the idea is to apply a registers worth of bits at a time.

Even with very nice vectorisable code, gcc didn't seem to want to do what I expected until I forced it to using vectors.

Note that the peformance isn't really measurably for this example because of the memory bottleneck, and I don't actually know if what I imagined is better or if gcc's way was better for this example.

However, to show what I expected, the below example produces this assembly for 7:

Note: I deleted single lines with : ff presuming they were from the previous instruction??

  400ba0:	c5 c5 eb 00          	vpor   ymm0,ymm7,YMMWORD PTR [rax]
  400ba4:	48 05 e0 00 00 00    	add    rax,0xe0
  400baa:	c5 fd 7f 80 20 ff ff 	vmovdqa YMMWORD PTR [rax-0xe0],ymm0
  400bb2:	c5 cd eb 80 40 ff ff 	vpor   ymm0,ymm6,YMMWORD PTR [rax-0xc0]
  400bba:	c5 fd 7f 80 40 ff ff 	vmovdqa YMMWORD PTR [rax-0xc0],ymm0
  400bc2:	c5 d5 eb 80 60 ff ff 	vpor   ymm0,ymm5,YMMWORD PTR [rax-0xa0]
  400bca:	c5 fd 7f 80 60 ff ff 	vmovdqa YMMWORD PTR [rax-0xa0],ymm0
  400bd2:	c5 dd eb 40 80       	vpor   ymm0,ymm4,YMMWORD PTR [rax-0x80]
  400bd7:	c5 fd 7f 40 80       	vmovdqa YMMWORD PTR [rax-0x80],ymm0
  400bdc:	c5 e5 eb 40 a0       	vpor   ymm0,ymm3,YMMWORD PTR [rax-0x60]
  400be1:	c5 fd 7f 40 a0       	vmovdqa YMMWORD PTR [rax-0x60],ymm0
  400be6:	c5 ed eb 40 c0       	vpor   ymm0,ymm2,YMMWORD PTR [rax-0x40]
  400beb:	c5 fd 7f 40 c0       	vmovdqa YMMWORD PTR [rax-0x40],ymm0
  400bf0:	c5 f5 eb 40 e0       	vpor   ymm0,ymm1,YMMWORD PTR [rax-0x20]
  400bf5:	c5 fd 7f 40 e0       	vmovdqa YMMWORD PTR [rax-0x20],ymm0
  400bfa:	48 39 c2             	cmp    rdx,rax
  400bfd:	75 a1                	jne    400ba0

typedef uint64_t v32_4si __attribute__((vector_size(32)));

static inline void
mark_low_prime (char *bitmap, uint64_t max, int a)
{
   int r, n;
   char     *pattern = aligned_alloc(32, a*32);
   register v32_4si v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, va, vb, vc;
   v32_4si *out = (v32_4si *)bitmap;

   bzero(pattern, a*32);

   for (r = 0; r < a*32*8; r += a)
      pattern[r/8] |= 1ul << r%8;

   switch (a) {
      case 13:
         vc = *(v32_4si *)(pattern + 32*12);
         vb = *(v32_4si *)(pattern + 32*11);
      case 11:
         va = *(v32_4si *)(pattern + 32*10);
         v9 = *(v32_4si *)(pattern + 32*9);
         v8 = *(v32_4si *)(pattern + 32*8);
         v7 = *(v32_4si *)(pattern + 32*7);
      case 7:
         v6 = *(v32_4si *)(pattern + 32*6);
         v5 = *(v32_4si *)(pattern + 32*5);
      case 5:
         v4 = *(v32_4si *)(pattern + 32*4);
         v3 = *(v32_4si *)(pattern + 32*3);
      case 3:
         v2 = *(v32_4si *)(pattern + 32*2);
      case 2:
         v1 = *(v32_4si *)(pattern + 32*1);
         v0 = *(v32_4si *)(pattern + 32*0);
   }

   n = max/8/32/a + 1;
   while (n--) {
      *out++ |= v0;
      *out++ |= v1;
      if (a == 2) continue;
      *out++ |= v2;
      if (a == 3) continue;
      *out++ |= v3;
      *out++ |= v4;
      if (a == 5) continue;
      *out++ |= v5;
      *out++ |= v6;
      if (a == 7) continue;
      *out++ |= v7;
      *out++ |= v8;
      *out++ |= v9;
      *out++ |= va;
      if (a == 11) continue;
      *out++ |= vb;
      *out++ |= vc;
   }
}



...


   mark_low_prime (bitmap, max, 2);
   mark_low_prime (bitmap, max, 3);
   mark_low_prime (bitmap, max, 5);
   mark_low_prime (bitmap, max, 7);
   mark_low_prime (bitmap, max, 11);
   mark_low_prime (bitmap, max, 13);

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <inttypes.h>


typedef uint64_t v32_4si __attribute__((vector_size(32)));

static inline void
mark_low_prime (char *bitmap, uint64_t max, int a)
{
   int r, n;
   char     *pattern = aligned_alloc(32, a*32);
   register v32_4si v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, va, vb, vc;
   v32_4si *out = (v32_4si *)bitmap;

   bzero(pattern, a*32);

   for (r = 0; r < a*32*8; r += a)
      pattern[r/8] |= 1ul << r%8;

   switch (a) {
      case 13:
         vc = *(v32_4si *)(pattern + 32*12);
         vb = *(v32_4si *)(pattern + 32*11);
      case 11:
         va = *(v32_4si *)(pattern + 32*10);
         v9 = *(v32_4si *)(pattern + 32*9);
         v8 = *(v32_4si *)(pattern + 32*8);
         v7 = *(v32_4si *)(pattern + 32*7);
      case 7:
         v6 = *(v32_4si *)(pattern + 32*6);
         v5 = *(v32_4si *)(pattern + 32*5);
      case 5:
         v4 = *(v32_4si *)(pattern + 32*4);
         v3 = *(v32_4si *)(pattern + 32*3);
      case 3:
         v2 = *(v32_4si *)(pattern + 32*2);
      case 2:
         v1 = *(v32_4si *)(pattern + 32*1);
         v0 = *(v32_4si *)(pattern + 32*0);
   }

   n = max/8/32/a + 1;
   while (n--) {
      *out++ |= v0;
      *out++ |= v1;
      if (a == 2) continue;
      *out++ |= v2;
      if (a == 3) continue;
      *out++ |= v3;
      *out++ |= v4;
      if (a == 5) continue;
      *out++ |= v5;
      *out++ |= v6;
      if (a == 7) continue;
      *out++ |= v7;
      *out++ |= v8;
      *out++ |= v9;
      *out++ |= va;
      if (a == 11) continue;
      *out++ |= vb;
      *out++ |= vc;
   }
}



void
calc_primes(char *bitmap, uint64_t max)
{
   uint64_t  a,
             r;


   mark_low_prime (bitmap, max, 2);
   mark_low_prime (bitmap, max, 3);
   mark_low_prime (bitmap, max, 5);
   mark_low_prime (bitmap, max, 7);
   mark_low_prime (bitmap, max, 11);
   mark_low_prime (bitmap, max, 13);


   for (a = 17; a < sqrtl(max); a++) {
      if (bitmap[a/8] & 1 << (a%8))
         continue;

      for (r = a * a; r <= max; r += a)
         bitmap[r/8] |= 1 << (r%8);
   }


   /* The first 16 bytes is ruined by the pattern setting */
   bitmap[0] = 0x53;
   bitmap[1] = 0xD7;
}


uint64_t
count_primes(char *bitmap, uint64_t max_bit) {
   uint64_t *p = (uint64_t *)bitmap,
             c = max_bit / 64,
         count = max_bit + 1;

   while (c--)
      count -= __builtin_popcountll(*p++);

   count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64))));

   return count;
}


int
main(int argc, char *argv[])
{
   uint64_t  max = argc > 1 ? atol(argv[1]) : 1000000000ull;
   char     *bitmap = aligned_alloc(32, max/8 + 32*16*2 + 1);

   bzero(bitmap, max/8 + 32*16*2 + 1);

   calc_primes(bitmap, max);

   printf("%ld\n", count_primes(bitmap, max));

   free(bitmap);
   return EXIT_SUCCESS;
}