hprime is a fast prime generator based on the sieve of Eratosthenes.
I plan to present the code in 4 parts:
This section introduces some of the optimisations that are used in hprime. The optimisations are all shown based off a simple implementation of the sieve of Eratosthenes (which is shown first). The idea is to show just the required changes so that each optimisation is easy to understand.
A simple sieve of Eratosthenes using a bitmap for storing primality with the optimisation of starting from the primes square
For a detailed explanation of the sieve of Eratosthenes, see the Wikipedia article.
A simple sieve of Eratosthenes using a bitmap for storing primality with the optimisation of starting from the primes square
The following is a basic description of how the sieve works.
The primality of a number can be stored in one bit - the number is either prime or it is not. Here, a bitmap is used where the number is determined by the position in the bitmap and the primality of that number is determined by whether the bit is not set.
The bitmap is used to search for a prime from smallest to largest. When a prime is found, all multiples of that prime are marked off on the bitmap for the remainder of the desired range as being "not prime". By working from smallest to largest every number that is checked will have had all possible prime factors checked first, and so the primality is known.
(0 and 1 have been pre-marked as not prime and the implementation starts at 2, with 2 being the first prime).
Note that this already contains an optimisation in that instead of starting from prime * 2
, it starts from prime * prime
, as multiples for numbers smaller than prime
will have already been marked off. However, this example is otherwise supposed to show the basic principle and is not an optimised version.
for (a = 2; a <= max; a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; num_primes++; for (r = a * a; r <= max; r += a) bitmap[r / 8] |= 1 << (r % 8); }
#include <stdio.h> #include <stdlib.h> #include <inttypes.h> static uint64_t calc_primes(char *bitmap, uint64_t max) { uint64_t a, r, num_primes = 0;for (a = 2; a <= max; a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; num_primes++; for (r = a * a; r <= max; r += a) bitmap[r / 8] |= 1 << (r % 8); }return num_primes; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/8 + 1, 1); printf("%ld\n", calc_primes(bitmap, max)); free(bitmap); return EXIT_SUCCESS; }
A quick method of counting primes in order to focus on speed of generation
The above animation intended to both mirror that in Wikipedia, and also show how the mechanism can be used to discover all primes within the range. However it is easy to see in the example above that for a > 11
, there are no more changes to the bitmap. This is because multiples are marked starting at a * a
, and so any value of a
greater than sqrt(max)
will make no changes to the bitmap.
The rest of the examples below take advantage of this and break the task down in to two parts:
The emphasis of the optimisations is on the first part, generating the bitmap.
As a baseline for demostrating using the bitmap, the primes are counted. This is relatively quick and simple as there is no need to know what number each bit represents, just how many bits are set.
The main algorithm here is the same as above except that it exits once reaching sqrt(max)
and the primes are counted separately.
The count_primes() function can be viewed in the full code. There is no need to understand this function to continue, it just counts the number of bits up to (and including) the specified bit.
for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (r = a * a; r <= max; r += a) bitmap[r / 8] |= 1 << (r % 8); }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> static void calc_primes(char *bitmap, uint64_t max) { uint64_t a, r;for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (r = a * a; r <= max; r += a) bitmap[r / 8] |= 1 << (r % 8); }/* mark bit 0 and 1 as not prime */ bitmap[0] |= 0x3; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/8 + 1, 1); calc_primes(bitmap, max); printf("%ld\n", count_primes(bitmap, max)); free(bitmap); return EXIT_SUCCESS; }
Avoiding extra calculations when choosing the byte and bit
In the above animation, the bytes boundaries are shown as well as the bits. A byte is of interest because it is the smallest addressable location for the CPU. There is no "bit pointer" that can be incremented by the prime each time, and there is no CPU instruction to directly set the required bit at an offset.
This means to "set the bit at location 49" actually requires work to determine the byte that the bit is located in, and then the bit location within that byte. This is the rather ugly bitmap[r/8] |= 1 << (r%8)
line in the above example.
The task of marking off multiples can be thought of in terms of Modular Arithmetic.
The examples below demonstrate how all multiples can be marked off in an efficient way without calculating the divisor and remainder each time.
Keep the same bitmask by marking every 8th multiple
A simple solution to reduce the work required to choose the byte and the bit is to set every 8th multiple of the prime.
This means, to find the next multiple, the resulting bit position is increased by a * 8
bits, which is equivilant to a * 8 / 8
or a
bytes. As this divides evenly the bit position within the byte doesn't change (ie (8 * a) % 8
equals 0).
Setting every 8th multiple of the prime then needs to be done 8 times for the 8 different starting offsets.
Note: one small optimisation is to use a pointer instead of an array and offset, since the next byte position is being chosen based off the previous one.
Note: The bitmask only needs to be calculated once for each a
and i
, which the compiler can easily pick up
for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (i = 0; i < 8; i++) for (bmp = bitmap + a * (a+i) / 8; bmp < bitmap_end; bmp += a) *bmp |= 1 << ((a * (a+i)) % 8); }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> static void calc_primes(char *bitmap, uint64_t max) { char *bmp, *bitmap_end = bitmap + max/8 + 1; uint64_t a; int i;for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (i = 0; i < 8; i++) for (bmp = bitmap + a * (a+i) / 8; bmp < bitmap_end; bmp += a) *bmp |= 1 << ((a * (a+i)) % 8); }/* mark bit 0 and 1 as not prime */ bitmap[0] |= 0x3; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/8 + 1, 1); calc_primes(bitmap, max); printf("%ld\n", count_primes(bitmap, max)); free(bitmap); return EXIT_SUCCESS; }
When marking every 8th multiple, process all starting offsets at the same time
The difference between the 8 initial byte offsets will remain the same as the next 8th multiple is marked for each (since all offsets increment by a
bytes). Also from above the bit position does not change for each initial offset. This can be described by saying that the same pattern of 8 byte offsets and bitmasks will repeat every a
bytes. These can be stored and applied over and over to the bitmap.
This is used below by continually incrementing a base pointer by a
bytes, and describing the 8 offsets relative to the base pointer.
Instead of the base byte pointer being (a * a) / 8
as above, it is chosen as a * (a / 8)
, which means the byte offsets become simply a * i / 8
and the bitmasks a * i % 8
. One way to think of this (without maths) is to note that a * (a / 8)
is a multiple of a
bytes so the pattern is the same as that which starts at 0.
Note: Starting at a * (a / 8) means the first byte is clobbered and needs to be corrected.
Note: Allowance has been made for the sets to overrun bitmap_end.
for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (bmp = bitmap + a * (a / 8); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + (a * i / 8)) |= 1 << (a * i % 8); }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> static void calc_primes(char *bitmap, uint64_t max) { char *bmp, *bitmap_end = bitmap + max/8 + 1; uint64_t a; int i;for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (bmp = bitmap + a * (a / 8); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + (a * i / 8)) |= 1 << (a * i % 8); }/* Reset the first byte as it is clobbered by the above algorithm*/ bitmap[0] = 0x53; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/8 + sqrtl(max) + 1, 1); calc_primes(bitmap, max); printf("%ld\n", count_primes(bitmap, max)); free(bitmap); return EXIT_SUCCESS; }
Taking advantage of repeating patterns for small primes
It has been shown above that when marking off all multiples there is a pattern of bit/bytes that repeats every a
bytes. For smaller primes where there is more than one bit per register-width, the pattern can be used to reduce the work required by using a single instruction to set all required bits in that register-width at once.
An example of directly setting the pattern for the lower primes
All odd primes (or all odd numbers) will not divide evenly in any register-width which are all powers of 2. Instead the pattern will repeat every a
bytes, or a
ints, or a
32-bit ymm registers. One way to think of it is that 8 repeating patterns of a
bytes will fit in a
8-byte integers.
For primes less than 16, there are enough registers so that the a
parts of a repeating pattern can be held in a
registers of the chosen register width. These registers can then be continually applied one after the other in a loop with no manipulation required.
In the animation, 8 byte ints are used for the primes 2 and 3 where the bit patterns for the lower primes have already been pre-set. This is just to show the effective speedup that this method can introduce. It is especially noticeable since in the initial animation, marking off the primes 2 and 3 were by far the slowest operation.
If viewing the full code, the patterns are dynamically created based on the prime and then repeatedly applied for all primes less than 16
Note: gcc doesn't use the registers as described for this example, but will for others. It can be forced - see here
for (i = 0; i < 6; i++) { bzero(pattern, sizeof pattern); a = low_a[i]; for (r = 0; r < a*64; r += a) pattern[r / 8] |= 1 << r % 8; for (bmp = bitmap; bmp < bitmap_end; bmp += a * sizeof(uint64_t)) for (j = 0; j < a; j++) *((uint64_t *)bmp + j) |= *((uint64_t *)pattern + j); }
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <math.h> #include <inttypes.h> static void calc_primes(char *bitmap, uint64_t max) { uint64_t a, r, i, j; int low_a[] = {2,3,5,7,11,13}; char pattern[16*8], *bmp, *bitmap_end = bitmap + max/8 + 1;for (i = 0; i < 6; i++) { bzero(pattern, sizeof pattern); a = low_a[i]; for (r = 0; r < a*64; r += a) pattern[r / 8] |= 1 << r % 8; for (bmp = bitmap; bmp < bitmap_end; bmp += a * sizeof(uint64_t)) for (j = 0; j < a; j++) *((uint64_t *)bmp + j) |= *((uint64_t *)pattern + j); }for (a = 17; a < sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (r = a * a; r <= max; r += a) bitmap[r / 8] |= 1 << (r % 8); } /* Reset the first bytes as they are clobbered by the above algorithm*/ bitmap[0] = 0x53; bitmap[1] = 0xD7; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/8 + 16*8 + 1, 1); calc_primes(bitmap, max); printf("%ld\n", count_primes(bitmap, max)); free(bitmap); return EXIT_SUCCESS; }
Maximising use of CPU caches by processing a block of numbers at a time
When calculating the primes to 1 billion the bitmap needs to be ~120MB which is much larger than the cpu caches. Currently, when setting the multiples, the full 120MB of the array is traversed for each prime and so needs to be read from and written out to main memory.
Instead, the idea is to calculate the bitmap in blocks so that only a portion of the full bitmap is marked off for each prime at a time.
Simple and direct example of processing a block at a time
The normal loop is used however an outside loop is added which marches through the blocks of numbers. Inside the normal loop multiples are marked off to the end of the current block. The main change is in selecting the start position. If a * a
is less than start of the current block, then instead the first multiple is found using (block_start + a - 1) / a * a
.
Only full blocks of 32*1024*8
are generated. This will generate more than max
but the bitmap is still only counted up to max
for (cur = 0; cur <= max; cur += 32*1024*8) { for (a = 2; a <= sqrtl(cur + 32*1024*8); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (r = MAX(a * a, (cur + a - 1) / a * a); r < cur + 32*1024*8; r += a) bitmap[r / 8] |= 1 << (r % 8); } }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> #define MAX(A,B) ((A) > (B) ? (A) : (B)) #define MIN(A,B) ((A) < (B) ? (A) : (B)) void calc_primes(char *bitmap, uint64_t max) { uint64_t a, r, cur;for (cur = 0; cur <= max; cur += 32*1024*8) { for (a = 2; a <= sqrtl(cur + 32*1024*8); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (r = MAX(a * a, (cur + a - 1) / a * a); r < cur + 32*1024*8; r += a) bitmap[r / 8] |= 1 << (r % 8); } }/* mark bit 0 and 1 as not prime */ bitmap[0] |= 0x3; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(1, max/8 + 32*1024); calc_primes(bitmap, max); printf("%ld\n", count_primes(bitmap, max)); free(bitmap); return EXIT_SUCCESS; }
Storing the primes and offsets to avoid initialising betwen blocks
Instead of extracting the primes from the bitmap each time, they can be stored in a list where it is easier to retrive them. Likewise, the final mutiple from the previous block can be stored and that value can be used at the start of the next block.
In this example the sieving primes are calculated and added to the list first.
for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (r = a * a; r < sqrtl(max); r += a) bitmap[r / 8] |= 1 << (r % 8); as[cnt] = a; rs[cnt] = r; cnt++; } for (cur = 0; cur <= max; cur += 32*1024*8) { for (i = 0; i < cnt; i++) for (; rs[i] < cur + 32*1024*8; rs[i] += as[i]) bitmap[rs[i]/8] |= 1 << (rs[i] % 8); }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> #define MIN(A,B) ((A) < (B) ? (A) : (B)) static void calc_primes(char *bitmap, uint64_t max) { uint32_t *as = malloc(sizeof (uint32_t) * (sqrtl(max) / 4 + 1000)); uint64_t *rs = malloc(sizeof (uint64_t) * (sqrtl(max) / 4 + 1000)); int cnt = 0, i; uint64_t a, r, cur;for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; for (r = a * a; r < sqrtl(max); r += a) bitmap[r / 8] |= 1 << (r % 8); as[cnt] = a; rs[cnt] = r; cnt++; } for (cur = 0; cur <= max; cur += 32*1024*8) { for (i = 0; i < cnt; i++) for (; rs[i] < cur + 32*1024*8; rs[i] += as[i]) bitmap[rs[i]/8] |= 1 << (rs[i] % 8); }/* mark bit 0 and 1 as not prime */ bitmap[0] |= 0x3; free(as); free(rs); } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(1, max/8 + 32*1024); calc_primes(bitmap, max); printf("%ld\n", count_primes(bitmap, max)); free(bitmap); return EXIT_SUCCESS; }
Avoiding unneeded operations and improving storage
Wheel factorisation can be used to show that only a certain subset of numbers modulo the wheel range can possibly be prime.
This is applied in two ways:
A simple wheel based on 1 possible prime in every 2 numbers
One observation with prime numbers is that except for the prime 2, there are no even prime numbers. That is, all even numbers are a multiple of 2 and so are not prime. When marking the prime 2 in the simple example, every second bit is marked as not prime. Also, when marking multiples, every second multiple of a prime will land on an even bit. Both these sets of work can be made redundant by ignoring even numbers in the code and only dealing with odd numbers (and treating the prime 2
as a special case).
This is a simple form of wheel factorisation. Except for the prime 2, there is only one possible prime in every 2 numbers.
The bitmap is made to represent only odd numbers. If r
is odd, the position in the bitmap is r / 2
. The bit at position x
refers to the number x * 2 + 1
. The bitmap cannot represent even numbers.
Even results can be skipped by marking every second multiple. Instead of incrementing the result r += a
it can be incremented by r += a*2
.
for (a = 3; a <= sqrtl(max); a += 2) { if (bitmap[a/2/8] & 1 << ((a/2) % 8)) continue; for (r = a * a; r <= max; r += a * 2) bitmap[r/2/8] |= 1 << ((r/2) % 8); }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> static void calc_primes(char *bitmap, uint64_t max) { uint64_t a, r;for (a = 3; a <= sqrtl(max); a += 2) { if (bitmap[a/2/8] & 1 << ((a/2) % 8)) continue; for (r = a * a; r <= max; r += a * 2) bitmap[r/2/8] |= 1 << ((r/2) % 8); }/* mark 1 as not prime */ bitmap[0] |= 0x1; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/8/2 + 1, 1); calc_primes(bitmap, max); printf("%ld\n", count_primes(bitmap, (max-1)/2 + 1)); free(bitmap); return EXIT_SUCCESS; }
A wheel where there is 8 possible primes in every 30 numbers
Above it was noted that after marking multiples of the prime 2, every second bit was set. The same thought can be expanded to note that after the primes 2,3,5 are set they produce a pattern of set bits that repeats every 30 numbers (2 * 3 * 5 = 30
). In that pattern there are only 8 possible primes that will not be set ((r % 30) = 1,7,11,13,17,19,23,29
). This is the 8/30 wheel (Wheel factorisation).
One complexity of this wheel compared to the simple example above is in determining the bit position. The possible primes are not spread evenly over the 30 numbers and so a simple division will not work to find the correct bit.
The solution involves breaking the result in to r / 30
and r % 30
. The base bit position can be found by multiplying r / 30
by the number of possible primes, and a specialised function can be used to convert r % 30
in to a bit position.
The absolute bit position is therefore r / 30 * 8 + mod_to_ind(r % 30)
.
Here the 8/30 wheel is extremely convenient as the 8 possible primes map directly to one byte. If there were 9 possible primes, an extra level of modular arithmetic would be required - one to convert to the absolute bit position in the bitmap, and another to convert to the byte/bit position in order to set that bit. Instead, the term r / 30
is the byte position and the term mod_to_ind(r % 30)
is the final bit position.
Below the function mod_to_ind() is implemented as an ad-hoc formula bit = ((r % 30) * 8 / 30)
. This just happens to produce the correct bit position for each possible prime.
When iterating through the bitmap, an array of the difference between the value each bit represents is used and this is cycled through, so the next possible prime can be found using a += diff[a_i++%8]
. When calculating the multiples, in the solution r = a * b
both a
and b
should not be multiples of (2,3,5). So a virtual b
is also iterating through possible primes, and the next result is found using r += a * diff[b_i++%8]
for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << ((a % 30) * 8 / 30)) continue; for (r = a * a, b_i = a_i; r <= max; r += a * diff[b_i++%8]) bitmap[r / 30] |= 1 << ((r % 30) * 8 / 30); }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> const char diff[] = {6,4,2,4,2,4,6,2}; static void calc_primes(char *bitmap, uint64_t max) { uint64_t a, r; int a_i, b_i;for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << ((a % 30) * 8 / 30)) continue; for (r = a * a, b_i = a_i; r <= max; r += a * diff[b_i++%8]) bitmap[r / 30] |= 1 << ((r % 30) * 8 / 30); }/* mark 1 as not prime */ bitmap[0] |= 0x1; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/30 + 1, 1); uint64_t adj_max; int a_i; /* adjust max to a multiple of 2,3,5 */ for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0; adj_max + diff[a_i%8] <= max; adj_max += diff[a_i++%8]) ; calc_primes(bitmap, adj_max); printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + (adj_max % 30) * 8 / 30) + 3); free(bitmap); return EXIT_SUCCESS; }
This section walks through the changes required to combine the different optimisations talked about in part 1.
The final example in this section generates primes to 10^9 in 125ms on my machine, compared to 116ms for primesieve and 79ms for hprime. This shows that these main optimisations are enough to get most of the way there.
The examples that were used to describe efficiently marking of bits for the simple bitmap also apply to the 8/30 wheel bitmap.
It was previously shown that in the simple bitmap every 8th multiple of a prime would mark the same bit, and that they would be a
bytes apart. With the 8/30 wheel this happens every 30th multiple. This can be described in the same way as being because (a * 30) / 30
is a
bytes, and (a * 30) % 30
is 0.
Note that there is no need to mark every 30th multiple for all 30 starting offsets, but only the 8 possible primes described in the bitmap. This is also similar to the simple example where 8 multiples were marked at a time. The main difference is that the difference between the multiples is not 1 (and is not constant).
Marking every 30th multiple for the 8 starting possible primes
This follows the example Mark all offsets at once from the simple bitmap but for the 8/30 wheel.
The 8 offsets bytes and bitmaps depend on the value that the bit at i
represents. This is looked up using an array bval (bit value) bval[] = {1,7,11,13,17,19,23,29}
.
The function num2bit()
is also introduced to perform (num % 30) * 8 / 30
Also, the offsets and bitmasks are now explicitly stored in an array.
for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i % 8)) continue; for (i = 0; i < 8; i++) { offsets[i] = a * bval[i] / 30; masks[i] = 1 << num2bit(a * bval[i]); } for (bmp = bitmap + a * (a / 30); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + offsets[i]) |= masks[i]; }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> const char diff[] = {6,4,2,4,2,4,6,2}; const char bval[] = {1,7,11,13,17,19,23,29}; static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; } static void calc_primes(char *bitmap, uint64_t max) { char *bmp, *bitmap_end = bitmap + max / 30 + 1; uint64_t a; int i, a_i, offsets[8]; char masks[8];for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i % 8)) continue; for (i = 0; i < 8; i++) { offsets[i] = a * bval[i] / 30; masks[i] = 1 << num2bit(a * bval[i]); } for (bmp = bitmap + a * (a / 30); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + offsets[i]) |= masks[i]; }/* Fix first byte which is clobbered above */ bitmap[0] = 0x1; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(1, max/30 + sqrtl(max) + 1); uint64_t adj_max; int a_i; /* adjust max to a multiple of 2,3,5 */ for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0; adj_max + diff[a_i%8] <= max; adj_max += diff[a_i++%8]) ; calc_primes(bitmap, adj_max); printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3); free(bitmap); return EXIT_SUCCESS; }
Forcing the bitmask order to allow compiled in bitmasks
Each prime in the 8/30 wheel shares no prime factors with 30 (2,3,5). This means that with 30 consecutive multiples, no a % 30
value will be the same.
This is used below to store the 8 offsets in an array according to the bit position of the result. This means the bit positions will be marked in the same order for every prime and can be encoded directly in the instructions. This will still use 8 registers for holding the 8 offsets, but that is not quite as critical as using 16 registers.
Note that this exact solution couldn't be done with the simple bitmap because of the prime 2, which divides evenly in to 8 meaning not all 8 bit positions are visited.
See the inner loop in assembly here
for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; for (i = 0; i < 8; i++) offsets[num2bit(a * bval[i])] = a * bval[i] / 30; for (bmp = bitmap + a * (a / 30); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + offsets[i]) |= 1 << i; }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> const char diff[] = {6,4,2,4,2,4,6,2}; const char bval[] = {1,7,11,13,17,19,23,29}; static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; } static void calc_primes(char *bitmap, uint64_t max) { char *bmp, *bitmap_end = bitmap + max / 30 + 1; uint64_t a; int i, a_i, offsets[8];for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; for (i = 0; i < 8; i++) offsets[num2bit(a * bval[i])] = a * bval[i] / 30; for (bmp = bitmap + a * (a / 30); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + offsets[i]) |= 1 << i; }/* Fix first byte which is clobbered above */ bitmap[0] = 0x1; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/30 + sqrtl(max) + 1, 1); uint64_t adj_max; int a_i; /* adjust max to a multiple of 2,3,5 */ for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0; adj_max + diff[a_i%8] <= max; adj_max += diff[a_i++%8]) ; calc_primes(bitmap, adj_max); printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3); free(bitmap); return EXIT_SUCCESS; }
Adding in calculations for lower primes on the 8/30 wheel
As shown above, the pattern still repeats every a
bytes on the 8/30 wheel. This means there is very little that needs changing in this section. One main point is that there are now fewer lower primes since the primes 2,3,5
do not need to be marked at all.
Adding in calculations for lower primes on the 8/30 wheel
This follows the example Repeat patterns for lower primes in part 1. The main change is incorporating the 8/30 wheel.
This optimisation is combined with the example above.
for (i = 1; i < 4; i++) { bzero(pattern, sizeof pattern); for (r = bval[i], a_i = 0; r < bval[i] * 30 * 8; r += bval[i] * diff[a_i++ % 8]) pattern[r / 30] |= 1 << num2bit(r); for (bmp = bitmap; bmp < bitmap_end; bmp += bval[i] * sizeof(uint64_t)) for (j = 0; j < bval[i]; j++) *((uint64_t *)bmp + j) |= *((uint64_t *)pattern + j); }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <string.h> #include <inttypes.h> const char diff[] = {6,4,2,4,2,4,6,2}; const char bval[] = {1,7,11,13,17,19,23,29}; static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; } static void calc_primes(char *bitmap, uint64_t max) { char *bmp, *bitmap_end = bitmap + max / 30 + 1, pattern[16 * 8]; uint64_t a; int i, j, r, a_i, offsets[8];for (i = 1; i < 4; i++) { bzero(pattern, sizeof pattern); for (r = bval[i], a_i = 0; r < bval[i] * 30 * 8; r += bval[i] * diff[a_i++ % 8]) pattern[r / 30] |= 1 << num2bit(r); for (bmp = bitmap; bmp < bitmap_end; bmp += bval[i] * sizeof(uint64_t)) for (j = 0; j < bval[i]; j++) *((uint64_t *)bmp + j) |= *((uint64_t *)pattern + j); }for (a = 17, a_i = 4; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; for (i = 0; i < 8; i++) offsets[num2bit(a * bval[i])] = a * bval[i] / 30; for (bmp = bitmap + a * (a / 30); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + offsets[i]) |= 1 << i; } /* Fix first byte which is clobbered above */ bitmap[0] = 0x1; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/30 + sqrtl(max) + 16*8, 1); uint64_t adj_max; int a_i; /* adjust max to a multiple of 2,3,5 */ for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0; adj_max + diff[a_i%8] <= max; adj_max += diff[a_i++%8]) ; calc_primes(bitmap, adj_max); printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3); free(bitmap); return EXIT_SUCCESS; }
Calculating a block at a time
Finally all optimisations are added together. This is done in 3 parts.
The final example counts primes to 10^9 in 125ms on my i7 5820k @ 4.4GHz. This is only a fraction behind primesieve-5.2 at 116ms.
Block at a time for the 8/30 wheel
This follows the example Processing a block at a time in part 1, while incorporating the 8/30 wheel with with all offsets being marked
for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) { for (a = 7, a_i = 1; a <= sqrtl(cur + 32*1024*30); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; for (i = 0; i < 8; i++) offsets[num2bit(a * bval[i])] = a * bval[i] / 30; for (bmp = bitmap + MAX(a * (a / 30), (cur / 30) / a * a); bmp < bitmap_end; bmp += a) { for (i = 0; i < 8; i++) *(bmp + offsets[i]) |= 1 << i; } } /* First byte gets clobbered for the first block */ if (cur == 0) bitmap[0] = 0x1; }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> #define MAX(A,B) ((A) > (B) ? (A) : (B)) const char diff[] = {6,4,2,4,2,4,6,2}; const char bval[] = {1,7,11,13,17,19,23,29}; static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; } static void calc_primes(char *bitmap, uint64_t max) { char *bmp, *bitmap_end = bitmap + 32*1024; uint64_t a, cur; int i, a_i, offsets[8];for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) { for (a = 7, a_i = 1; a <= sqrtl(cur + 32*1024*30); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; for (i = 0; i < 8; i++) offsets[num2bit(a * bval[i])] = a * bval[i] / 30; for (bmp = bitmap + MAX(a * (a / 30), (cur / 30) / a * a); bmp < bitmap_end; bmp += a) { for (i = 0; i < 8; i++) *(bmp + offsets[i]) |= 1 << i; } } /* First byte gets clobbered for the first block */ if (cur == 0) bitmap[0] = 0x1; }} static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/30 + 2*32*1024 + sqrtl(max) + 1, 1); uint64_t adj_max; int a_i; /* adjust max to a multiple of 2,3,5 */ for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0; adj_max + diff[a_i%8] <= max; adj_max += diff[a_i++%8]) ; calc_primes(bitmap, adj_max); printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3); free(bitmap); return EXIT_SUCCESS; }
Pre-processing sieving primes and storing calculations
This follows the example Storing state between blocks in part 1, while incorporating the 8/30 wheel with with all offsets being marked
/* * Find sieving primes */ bitmap_end = bitmap + (int64_t)sqrtl(max) + 1; for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; as[cnt] = a; for (i = 0; i < 8; i++) offs[cnt*8 + num2bit(a * bval[i])] = a * bval[i] / 30; for (bmp = bitmap + (uint64_t)a * (a / 30); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + offs[cnt*8 + i]) |= 1 << i; bmps[cnt++] = (bmp - bitmap); } /* * Calculate blocks */ bitmap_end = bitmap + 32*1024; for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) { for (i = 0; i < cnt; i++) for (; bmps[i] < cur/30 + 32*1024; bmps[i] += as[i]) for (j = 0; j < 8; j++) *(bitmap + bmps[i] + offs[i*8 + j]) |= 1 << j; }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> #define MAX(A,B) ((A) > (B) ? (A) : (B)) const char diff[] = {6,4,2,4,2,4,6,2}; const char bval[] = {1,7,11,13,17,19,23,29}; static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; } static void calc_primes(char *bitmap, uint64_t max) { char *bmp, *bitmap_end; uint64_t a, cur; uint32_t *as = malloc(sizeof(uint32_t) * (sqrtl(max) / 4 + 1000)), *bmps = malloc(sizeof(uint32_t) * (sqrtl(max) / 4 + 1000)), *offs = malloc(sizeof(uint32_t) * 8 * (sqrtl(max) / 4 + 1000)); int i, j, a_i, cnt = 0;/* * Find sieving primes */ bitmap_end = bitmap + (int64_t)sqrtl(max) + 1; for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; as[cnt] = a; for (i = 0; i < 8; i++) offs[cnt*8 + num2bit(a * bval[i])] = a * bval[i] / 30; for (bmp = bitmap + (uint64_t)a * (a / 30); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + offs[cnt*8 + i]) |= 1 << i; bmps[cnt++] = (bmp - bitmap); } /* * Calculate blocks */ bitmap_end = bitmap + 32*1024; for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) { for (i = 0; i < cnt; i++) for (; bmps[i] < cur/30 + 32*1024; bmps[i] += as[i]) for (j = 0; j < 8; j++) *(bitmap + bmps[i] + offs[i*8 + j]) |= 1 << j; }/* First byte gets clobbered for the first block */ bitmap[0] = 0x1; free(as); free(bmps); free(offs); } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/30 + 2*32*1024 + sqrtl(max) + 1, 1); uint64_t adj_max; int a_i; /* adjust max to a multiple of 2,3,5 */ for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0; adj_max + diff[a_i%8] <= max; adj_max += diff[a_i++%8]) ; calc_primes(bitmap, adj_max); printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3); free(bitmap); return EXIT_SUCCESS; }
Also marking lower primes
This adds in the lower prime optimisations
/* Init lower primes */ for (i = 1; i < 4; i++) for (r = bval[i], a_i = 0; r < bval[i] * 30 * 8; r += bval[i] * diff[a_i++ % 8]) pattern[i-1][r / 30] |= 1 << num2bit(r); /* * Find sieving primes */ bitmap_end = bitmap + (int64_t)sqrtl(max) + 1; for (i = 1; i < 4; i++) for (bmp = bitmap; bmp < bitmap_end; bmp += bval[i]*8) for (j = 0; j < bval[i]; j++) *((uint64_t *)bmp + j) |= *((uint64_t *)pattern[i-1] + j); for (a = 17, a_i = 4; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; as[cnt] = a; for (i = 0; i < 8; i++) offs[cnt*8 + num2bit(a * bval[i])] = a * bval[i] / 30; for (bmp = bitmap + (uint64_t)a * (a / 30); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + offs[cnt*8 + i]) |= 1 << i; bmps[cnt++] = (bmp - bitmap); } /* * Calculate blocks */ bitmap_end = bitmap + 32*1024; for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) { for (i = 1; i < 4; i++) for (bmp = bitmap + cur/30/(8*bval[i])*(bval[i]*8); bmp < bitmap_end; bmp += bval[i]*8) for (j = 0; j < bval[i]; j++) *((uint64_t *)bmp + j) |= *((uint64_t *)pattern[i-1] + j); for (i = 0; i < cnt; i++) for (; bmps[i] < cur/30 + 32*1024; bmps[i] += as[i]) for (j = 0; j < 8; j++) *(bitmap + bmps[i] + offs[i*8 + j]) |= 1 << j; }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> #define MAX(A,B) ((A) > (B) ? (A) : (B)) const char diff[] = {6,4,2,4,2,4,6,2}; const char bval[] = {1,7,11,13,17,19,23,29}; static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; } static void calc_primes(char *bitmap, uint64_t max) { char *bmp, *bitmap_end, pattern[3][16*8] = {{0}}; uint64_t a, cur; uint32_t *as = malloc(sizeof(uint32_t) * (sqrtl(max) / 4 + 1000)), *bmps = malloc(sizeof(uint32_t) * (sqrtl(max) / 4 + 1000)), *offs = malloc(sizeof(uint32_t) * 8 * (sqrtl(max) / 4 + 1000)); int i, j, r, a_i, cnt = 0;/* Init lower primes */ for (i = 1; i < 4; i++) for (r = bval[i], a_i = 0; r < bval[i] * 30 * 8; r += bval[i] * diff[a_i++ % 8]) pattern[i-1][r / 30] |= 1 << num2bit(r); /* * Find sieving primes */ bitmap_end = bitmap + (int64_t)sqrtl(max) + 1; for (i = 1; i < 4; i++) for (bmp = bitmap; bmp < bitmap_end; bmp += bval[i]*8) for (j = 0; j < bval[i]; j++) *((uint64_t *)bmp + j) |= *((uint64_t *)pattern[i-1] + j); for (a = 17, a_i = 4; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; as[cnt] = a; for (i = 0; i < 8; i++) offs[cnt*8 + num2bit(a * bval[i])] = a * bval[i] / 30; for (bmp = bitmap + (uint64_t)a * (a / 30); bmp < bitmap_end; bmp += a) for (i = 0; i < 8; i++) *(bmp + offs[cnt*8 + i]) |= 1 << i; bmps[cnt++] = (bmp - bitmap); } /* * Calculate blocks */ bitmap_end = bitmap + 32*1024; for (cur = 0; cur <= max; cur += 32*1024*30, bitmap_end += 32*1024) { for (i = 1; i < 4; i++) for (bmp = bitmap + cur/30/(8*bval[i])*(bval[i]*8); bmp < bitmap_end; bmp += bval[i]*8) for (j = 0; j < bval[i]; j++) *((uint64_t *)bmp + j) |= *((uint64_t *)pattern[i-1] + j); for (i = 0; i < cnt; i++) for (; bmps[i] < cur/30 + 32*1024; bmps[i] += as[i]) for (j = 0; j < 8; j++) *(bitmap + bmps[i] + offs[i*8 + j]) |= 1 << j; }/* First byte gets clobbered for the first block */ bitmap[0] = 0x1; free(as); free(bmps); free(offs); } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/30 + 2*32*1024 + sqrtl(max) + 1, 1); uint64_t adj_max; int a_i; /* adjust max to a multiple of 2,3,5 */ for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0; adj_max + diff[a_i%8] <= max; adj_max += diff[a_i++%8]) ; calc_primes(bitmap, adj_max); printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3); free(bitmap); return EXIT_SUCCESS; }
The examples were all built with gcc (or gcc-5) on ubuntu using the following command:
gcc-5 -g -Wall -march=native -O3 test_xxx.c -lm -o test-xxx
The times given are for my machine i7-5820K at 4.4GHz
Ubuntu had trouble ramping the cpu up fast enough so I turned all cpus to 'performance' using cpu-freq
Things that didn't really fit elsewhere
This section is a scratchpad for things I think of that are interesting but that I can't fit elsewhere. It is not expected to make sense outside of the context of the main parts (read the parts first).Some short interesting points are:
If the sieving primes are grouped by powers of 2, the workload from each group is approximately the same. If the prime doubles, then half the amount of bits need to be marked off in the bitmap. However, the number of primes falling within that power of 2 will approximately be double.
Why do benchmarks against 10^9
The simple answer is that 10^8 is too quick, and 10^10 is too slow
Another answer is that the sqrt(10^9) is 31622, which is close to the block size used for the inner block. For sieving primes larger than this there is not a full repeat pattern within the one block, which means a different approach is needed.
As an aside, for the prime 2 the pattern repeats every byte, and not just every 2 bytes. This is because 2 is a prime factor of 8 (the only prime factor) and divides evenly. Every other prime (in fact every odd number) shares no prime factors with 8 and so the pattern will only repeat every a
bytes. This also applies to all other register widths that are used, as they are all powers of 2. So for example, the pattern when aligned to an int boundary will repeat every a
ints (and no less, except for the prime 2). This means there are no further shortcuts (ie smaller patterns) to reduce the work for the lower primes.
Comparing the assembly of test_004.c and test_005.c (and now test_013.c)
The assembly for test_004.c shows many registers being used in the inside loop:
400a50: 80 08 01 or BYTE PTR [rax],0x1 400a53: 0f b6 54 24 20 movzx edx,BYTE PTR [rsp+0x20] 400a58: 44 08 3c 38 or BYTE PTR [rax+rdi*1],r15b 400a5c: 46 08 34 18 or BYTE PTR [rax+r11*1],r14b 400a60: 08 14 30 or BYTE PTR [rax+rsi*1],dl 400a63: 48 8b 54 24 10 mov rdx,QWORD PTR [rsp+0x10] 400a68: 46 08 2c 10 or BYTE PTR [rax+r10*1],r13b 400a6c: 44 08 24 10 or BYTE PTR [rax+rdx*1],r12b 400a70: 42 08 2c 08 or BYTE PTR [rax+r9*1],bpl 400a74: 42 08 0c 00 or BYTE PTR [rax+r8*1],cl 400a78: 48 01 d8 add rax,rbx 400a7b: 48 39 44 24 08 cmp QWORD PTR [rsp+0x8],rax 400a80: 77 ce ja 400a50
By comparison, the assembly for test_005.c shows fewer registers being used in the inside loop as the offsets and bitmasks are encoded directly in the machine instructions. There are now 8 different inside loop sections. This example is for the case where (a % 8) == 1
:
400c90: 80 0a 01 or BYTE PTR [rdx],0x1 400c93: 80 09 02 or BYTE PTR [rcx],0x2 400c96: 80 0c 42 04 or BYTE PTR [rdx+rax*2],0x4 400c9a: 80 0c 41 08 or BYTE PTR [rcx+rax*2],0x8 400c9e: 80 0c 82 10 or BYTE PTR [rdx+rax*4],0x10 400ca2: 80 0c 81 20 or BYTE PTR [rcx+rax*4],0x20 400ca6: 4c 01 d1 add rcx,r10 400ca9: 80 0c 32 40 or BYTE PTR [rdx+rsi*1],0x40 400cad: 42 80 0c 02 80 or BYTE PTR [rdx+r8*1],0x80 400cb2: 4c 01 ca add rdx,r9 400cb5: 48 39 d5 cmp rbp,rdx 400cb8: 73 d6 jae 400c90
test_013.c (for the 8/30 wheel) shows the use of registers for the relative offsets but immediate bitmasks.
400aa0: 42 80 0c 18 01 or BYTE PTR [rax+r11*1],0x1 400aa5: 42 80 0c 10 02 or BYTE PTR [rax+r10*1],0x2 400aaa: 42 80 0c 08 04 or BYTE PTR [rax+r9*1],0x4 400aaf: 42 80 0c 00 08 or BYTE PTR [rax+r8*1],0x8 400ab4: 80 0c 38 10 or BYTE PTR [rax+rdi*1],0x10 400ab8: 80 0c 30 20 or BYTE PTR [rax+rsi*1],0x20 400abc: 80 0c 08 40 or BYTE PTR [rax+rcx*1],0x40 400ac0: 80 0c 10 80 or BYTE PTR [rax+rdx*1],0x80 400ac4: 48 01 d8 add rax,rbx 400ac7: 49 39 c5 cmp r13,rax 400aca: 77 d4 ja 400aa0
Another way of viewing the Modular Arithmetic
An understanding of the terms in the last line will help to see what kinds of optimisations are available when marking off multiples.
x = xb*8 + xi where xb = x / 8 (or ⌊x / 8⌋ ie int division) xi = x % 8 r = a * b = (ab*8 + ai) * (bb*8 + bi) = a * bb * 8 + ab * bi * 8 + ai * bi
The terms in the last line show:
bb
without changing bi
only affects the first term. This will move the result forward by a
bytes.
8
.
ab
and bb
- There are only 8 possible values each for ai
and bi
.
The cachegrind results of primesieve and hprime to 10^9 were interesting.
Firstly it can really be seen that primesieve concentrates on reducing cache misses. The D1 misses are just 2%. Numbers wise, hprime has nearly 3 times as many D1 cache misses. One thing though is that this is only for reads and hprime actually has fewer write misses. Also, hprime has fewer references over all.
With branch mispredictions, primesieve has nearly 3 times as many branches and over 5 times as many mispredictions.
-- primesieve -- I refs: 736,461,590 I1 misses: 2,775 LLi misses: 2,455 I1 miss rate: 0.00% LLi miss rate: 0.00% D refs: 403,559,909 (366,095,043 rd + 37,464,866 wr) D1 misses: 8,086,489 ( 7,491,671 rd + 594,818 wr) LLd misses: 16,180 ( 8,440 rd + 7,740 wr) D1 miss rate: 2.0% ( 2.0% + 1.6% ) LLd miss rate: 0.0% ( 0.0% + 0.0% ) LL refs: 8,089,264 ( 7,494,446 rd + 594,818 wr) LL misses: 18,635 ( 10,895 rd + 7,740 wr) LL miss rate: 0.0% ( 0.0% + 0.0% ) Branches: 101,180,663 ( 99,395,683 cond + 1,784,980 ind) Mispredicts: 5,737,876 ( 3,971,367 cond + 1,766,509 ind) Mispred rate: 5.7% ( 4.0% + 99.0% ) -- hprime -- I refs: 594,838,670 I1 misses: 2,055 LLi misses: 1,918 I1 miss rate: 0.00% LLi miss rate: 0.00% D refs: 322,949,133 (298,099,527 rd + 24,849,606 wr) D1 misses: 21,850,722 ( 21,640,269 rd + 210,453 wr) LLd misses: 4,368 ( 2,690 rd + 1,678 wr) D1 miss rate: 6.8% ( 7.3% + 0.8% ) LLd miss rate: 0.0% ( 0.0% + 0.0% ) LL refs: 21,852,777 ( 21,642,324 rd + 210,453 wr) LL misses: 6,286 ( 4,608 rd + 1,678 wr) LL miss rate: 0.0% ( 0.0% + 0.0% ) Branches: 37,871,700 ( 37,677,506 cond + 194,194 ind) Mispredicts: 993,423 ( 989,096 cond + 4,327 ind) Mispred rate: 2.6% ( 2.6% + 2.2% )
I decided these were too complex to make it in
There are 8 possible paths when marking every a
bits
After marking the first r
at a given byte and bit, the byte and bit of the next multiple is calculated by moving forward a
bits. This can be broken in to two parts, moving forward a / 8
bytes, followed by a % 8
bits. When adding the final bits this will either overflow in to the next byte ( + 1
), or remain the same. This possible overflow byte and the final bit position depends entirely on the a % 8
value and the previously bit offset r % 8
.
As shown above, after 8 multiples the same bit position will occur again, which completes a pattern. Each prime can be made to start marking multiples on the 0th bit by starting from r = a * (a / 8)
, and so can start from a known place. This means that each of the 8 possible a % 8
values has a set pattern of 8 bitmask and overflow byte results that can be used to set the next 8 multiples.
The code below chooses the path based on the value of a % 8
. This is done in a way that allows the compiler to separately optimise each path. Note that all of the i * j
terms in the mark_multiples
function will be resolved at compile time.
One main advantage of this example is that it can encode the overflows and bit masks within the machine instructions themselves, instead of storing the information in registers (see here).
Note: This now involves a choice/branch for every prime that is encountered. This choice can be reduced by processing primes with the same a % 8
value at once. This would also reduce the work to extract the prime from the bitmap. However that part of the optimisation is not shown here.
static inline void __attribute__((always_inline)) mark_multiples(char *bitmap, uint64_t max, uint64_t a, int j) { char *bmp; int i; for (bmp = bitmap + a * (a / 8); bmp <= bitmap + max/8; ) for (i = 0; i < 8; i++, bmp += a/8 + ((i * j / 8) - (i-1) * j / 8)) *bmp |= 1 << ((i * j) % 8); } ... for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; switch (a % 8) { case 0: mark_multiples(bitmap, max, a, 0); break; case 1: mark_multiples(bitmap, max, a, 1); break; case 2: mark_multiples(bitmap, max, a, 2); break; case 3: mark_multiples(bitmap, max, a, 3); break; case 4: mark_multiples(bitmap, max, a, 4); break; case 5: mark_multiples(bitmap, max, a, 5); break; case 6: mark_multiples(bitmap, max, a, 6); break; case 7: mark_multiples(bitmap, max, a, 7); break; } }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h>static inline void __attribute__((always_inline)) mark_multiples(char *bitmap, uint64_t max, uint64_t a, int j) { char *bmp; int i; for (bmp = bitmap + a * (a / 8); bmp <= bitmap + max/8; ) for (i = 0; i < 8; i++, bmp += a/8 + ((i * j / 8) - (i-1) * j / 8)) *bmp |= 1 << ((i * j) % 8); }static void calc_primes(char *bitmap, uint64_t max) { uint64_t a;for (a = 2; a <= sqrtl(max); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; switch (a % 8) { case 0: mark_multiples(bitmap, max, a, 0); break; case 1: mark_multiples(bitmap, max, a, 1); break; case 2: mark_multiples(bitmap, max, a, 2); break; case 3: mark_multiples(bitmap, max, a, 3); break; case 4: mark_multiples(bitmap, max, a, 4); break; case 5: mark_multiples(bitmap, max, a, 5); break; case 6: mark_multiples(bitmap, max, a, 6); break; case 7: mark_multiples(bitmap, max, a, 7); break; } }/* Reset the first byte as it is clobbered by the above algorithm*/ bitmap[0] = 0x53; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/8 + sqrtl(max) + 1, 1); calc_primes(bitmap, max); printf("%ld\n", count_primes(bitmap, max)); free(bitmap); return EXIT_SUCCESS; }
Choosing a path with the 8/30 wheel
This follows the example Choosing a path from the simple bitmap but for the 8/30 wheel.
In the simple example the next multiple was shown to be a / 8
bytes apart with a possible single byte adjustment.
In this example it is not the next multiple that needs to be marked but the multiple associated with the next possible prime, so the base increment is a / 30 * diff[i]
, and the number of overflow bytes can be up to diff[i]
bytes.
The assembly shows that this is not optimised as well as in the simple example, and potentially not as well as the example above. However it will later be shown that this mechanism is still useful moving forwards.
Once again note that the (iv*jv) terms can all be resolved at compile time and is what the compiler can use to optimise the different branches.
static inline void __attribute__((always_inline)) mark_multiples (char *bitmap, char *bitmap_end, uint64_t a, int jv) { char *bmp; int i, iv; bmp = bitmap + a * (a/30) + a/30; while (bmp < bitmap_end) { for (iv = 1, i = 0; i < 8; iv += diff[i++]) { *bmp |= 1 << num2bit(iv * jv); bmp += a/30 * diff[i] + (((iv + diff[i]) * jv) / 30 - (iv * jv) / 30); } } } ... for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; switch (a_i % 8) { case 0 : mark_multiples(bitmap, bitmap_end, a, 1); break; case 1 : mark_multiples(bitmap, bitmap_end, a, 7); break; case 2 : mark_multiples(bitmap, bitmap_end, a, 11); break; case 3 : mark_multiples(bitmap, bitmap_end, a, 13); break; case 4 : mark_multiples(bitmap, bitmap_end, a, 17); break; case 5 : mark_multiples(bitmap, bitmap_end, a, 19); break; case 6 : mark_multiples(bitmap, bitmap_end, a, 23); break; case 7 : mark_multiples(bitmap, bitmap_end, a, 29); break; } }
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <inttypes.h> const char diff[] = {6,4,2,4,2,4,6,2}; static inline int num2bit(uint64_t num) { return (num % 30) * 8 / 30; }static inline void __attribute__((always_inline)) mark_multiples (char *bitmap, char *bitmap_end, uint64_t a, int jv) { char *bmp; int i, iv; bmp = bitmap + a * (a/30) + a/30; while (bmp < bitmap_end) { for (iv = 1, i = 0; i < 8; iv += diff[i++]) { *bmp |= 1 << num2bit(iv * jv); bmp += a/30 * diff[i] + (((iv + diff[i]) * jv) / 30 - (iv * jv) / 30); } } }static void calc_primes(char *bitmap, uint64_t max) { char *bitmap_end = bitmap + max/30 + 1; uint64_t a; int a_i;for (a = 7, a_i = 1; a <= sqrtl(max); a += diff[a_i++%8]) { if (bitmap[a / 30] & 1 << (a_i%8)) continue; switch (a_i % 8) { case 0 : mark_multiples(bitmap, bitmap_end, a, 1); break; case 1 : mark_multiples(bitmap, bitmap_end, a, 7); break; case 2 : mark_multiples(bitmap, bitmap_end, a, 11); break; case 3 : mark_multiples(bitmap, bitmap_end, a, 13); break; case 4 : mark_multiples(bitmap, bitmap_end, a, 17); break; case 5 : mark_multiples(bitmap, bitmap_end, a, 19); break; case 6 : mark_multiples(bitmap, bitmap_end, a, 23); break; case 7 : mark_multiples(bitmap, bitmap_end, a, 29); break; } }/* Reset first byte clobbered above */ bitmap[0] = 0x1; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = calloc(max/30 + sqrtl(max) + 1, 1); uint64_t adj_max; int a_i; /* adjust max to a multiple of 2,3,5 */ for (adj_max = (max - 1) / 30 * 30 + 1, a_i = 0; adj_max + diff[a_i%8] <= max; adj_max += diff[a_i++%8]) ; calc_primes(bitmap, adj_max); printf("%ld\n", count_primes(bitmap, adj_max / 30 * 8 + num2bit(adj_max)) + 3); free(bitmap); return EXIT_SUCCESS; }
Introducing abstractions and using the same block of memory each time
As explained in the Simple sieve of Eratosthenes section, there is a distinction between generating the primes and using the generated primes.
The block-at-a-time concept can also be applied to the "use the bitmap" function, which here is counting primes.
One reason to do this is that if the primes are both generated and consumed in blocks, there is no need to keep anything other than the current block in memory. The example below only allocates 32*1024
bytes for the bitmap.
To faciliate the caller counting primes in blocks some state is kept, which complicates the example. However, the idea is to present a clear concept of generating and then consuming a block at a time.
To reduce the changes and keep the general algorithm the same, re-using the block is achieved by moving the base pointer backwards so the new offsets now fall inside the same block. This keeps the algorithm the same, however is arguably not the best approach moving forwards.
for (i = 0; i < ctx->cnt; i++) for (; ctx->rs[i] < ctx->next; ctx->rs[i] += ctx->as[i]) bitmap[ctx->rs[i] / 8] |= 1 << (ctx->rs[i] % 8); for (a = MAX(ctx->cur,ctx->as[ctx->cnt-1]); a < MIN(sqrtl(ctx->max), ctx->cur + 32*1024*8); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; ctx->as[ctx->cnt] = a; ctx->rs[ctx->cnt] = a * a; ctx->cnt++; } ... init_ctx(&ctx, max); while (calc_block(&ctx)) num_primes += count_primes(ctx.bitmap, MIN(32*1024*8 - 1, max - ctx.cur));
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <inttypes.h> #include <math.h> #define MAX(A,B) ((A) > (B) ? (A) : (B)) #define MIN(A,B) ((A) < (B) ? (A) : (B)) #define CEIL_TO(A, B) (((A) + (B) - 1) / (B) * (B)) struct ctx { char *bitmap; uint64_t max; uint64_t cur; uint64_t next; uint32_t *as; uint64_t *rs; int cnt; }; static void init_ctx (struct ctx *ctx, uint64_t max) { uint32_t a; uint32_t r; ctx->bitmap = calloc(1, 32*1024 ); ctx->as = malloc(sizeof (uint32_t) * (sqrtl(max) / 4 + 1000)); ctx->rs = malloc(sizeof (uint64_t) * (sqrtl(max) / 4 + 1000)); ctx->max = max; ctx->cur = 0; ctx->next = 0; ctx->cnt = 0; /* Bootstrap primes for the first block */ for (a = 2; a <= sqrtl(32*1024*8); a++) { if (ctx->bitmap[a / 8] & 1 << (a % 8)) continue; for (r = a * a; r < sqrtl(32*1024*8); r += a) ctx->bitmap[r / 8] |= 1 << (r % 8); ctx->as[ctx->cnt] = a; ctx->rs[ctx->cnt] = r; ctx->cnt++; } ctx->bitmap[0] |= 0x3; } static int calc_block(struct ctx *ctx) { int i; char *bitmap; uint32_t a; ctx->cur = ctx->next; ctx->next += 32*1024*8; if (ctx->cur > ctx->max) return 0; if (ctx->cur != 0) bzero(ctx->bitmap, 32*1024); bitmap = ctx->bitmap - ctx->cur/8;for (i = 0; i < ctx->cnt; i++) for (; ctx->rs[i] < ctx->next; ctx->rs[i] += ctx->as[i]) bitmap[ctx->rs[i] / 8] |= 1 << (ctx->rs[i] % 8); for (a = MAX(ctx->cur,ctx->as[ctx->cnt-1]); a < MIN(sqrtl(ctx->max), ctx->cur + 32*1024*8); a++) { if (bitmap[a / 8] & 1 << (a % 8)) continue; ctx->as[ctx->cnt] = a; ctx->rs[ctx->cnt] = a * a; ctx->cnt++; }return 1; } static uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ul; uint64_t num_primes = 0ul; struct ctx ctx;init_ctx(&ctx, max); while (calc_block(&ctx)) num_primes += count_primes(ctx.bitmap, MIN(32*1024*8 - 1, max - ctx.cur));printf("%ld\n", num_primes); return EXIT_SUCCESS; }
From Applying patterns for smaller primes section, the idea is to apply a registers worth of bits at a time.
Even with very nice vectorisable code, gcc didn't seem to want to do what I expected until I forced it to using vectors.
Note that the peformance isn't really measurably for this example because of the memory bottleneck, and I don't actually know if what I imagined is better or if gcc's way was better for this example.
However, to show what I expected, the below example produces this assembly for 7
:
Note: I deleted single lines with : ff
presuming they were from the previous instruction??
400ba0: c5 c5 eb 00 vpor ymm0,ymm7,YMMWORD PTR [rax] 400ba4: 48 05 e0 00 00 00 add rax,0xe0 400baa: c5 fd 7f 80 20 ff ff vmovdqa YMMWORD PTR [rax-0xe0],ymm0 400bb2: c5 cd eb 80 40 ff ff vpor ymm0,ymm6,YMMWORD PTR [rax-0xc0] 400bba: c5 fd 7f 80 40 ff ff vmovdqa YMMWORD PTR [rax-0xc0],ymm0 400bc2: c5 d5 eb 80 60 ff ff vpor ymm0,ymm5,YMMWORD PTR [rax-0xa0] 400bca: c5 fd 7f 80 60 ff ff vmovdqa YMMWORD PTR [rax-0xa0],ymm0 400bd2: c5 dd eb 40 80 vpor ymm0,ymm4,YMMWORD PTR [rax-0x80] 400bd7: c5 fd 7f 40 80 vmovdqa YMMWORD PTR [rax-0x80],ymm0 400bdc: c5 e5 eb 40 a0 vpor ymm0,ymm3,YMMWORD PTR [rax-0x60] 400be1: c5 fd 7f 40 a0 vmovdqa YMMWORD PTR [rax-0x60],ymm0 400be6: c5 ed eb 40 c0 vpor ymm0,ymm2,YMMWORD PTR [rax-0x40] 400beb: c5 fd 7f 40 c0 vmovdqa YMMWORD PTR [rax-0x40],ymm0 400bf0: c5 f5 eb 40 e0 vpor ymm0,ymm1,YMMWORD PTR [rax-0x20] 400bf5: c5 fd 7f 40 e0 vmovdqa YMMWORD PTR [rax-0x20],ymm0 400bfa: 48 39 c2 cmp rdx,rax 400bfd: 75 a1 jne 400ba0
typedef uint64_t v32_4si __attribute__((vector_size(32))); static inline void mark_low_prime (char *bitmap, uint64_t max, int a) { int r, n; char *pattern = aligned_alloc(32, a*32); register v32_4si v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, va, vb, vc; v32_4si *out = (v32_4si *)bitmap; bzero(pattern, a*32); for (r = 0; r < a*32*8; r += a) pattern[r/8] |= 1ul << r%8; switch (a) { case 13: vc = *(v32_4si *)(pattern + 32*12); vb = *(v32_4si *)(pattern + 32*11); case 11: va = *(v32_4si *)(pattern + 32*10); v9 = *(v32_4si *)(pattern + 32*9); v8 = *(v32_4si *)(pattern + 32*8); v7 = *(v32_4si *)(pattern + 32*7); case 7: v6 = *(v32_4si *)(pattern + 32*6); v5 = *(v32_4si *)(pattern + 32*5); case 5: v4 = *(v32_4si *)(pattern + 32*4); v3 = *(v32_4si *)(pattern + 32*3); case 3: v2 = *(v32_4si *)(pattern + 32*2); case 2: v1 = *(v32_4si *)(pattern + 32*1); v0 = *(v32_4si *)(pattern + 32*0); } n = max/8/32/a + 1; while (n--) { *out++ |= v0; *out++ |= v1; if (a == 2) continue; *out++ |= v2; if (a == 3) continue; *out++ |= v3; *out++ |= v4; if (a == 5) continue; *out++ |= v5; *out++ |= v6; if (a == 7) continue; *out++ |= v7; *out++ |= v8; *out++ |= v9; *out++ |= va; if (a == 11) continue; *out++ |= vb; *out++ |= vc; } } ... mark_low_prime (bitmap, max, 2); mark_low_prime (bitmap, max, 3); mark_low_prime (bitmap, max, 5); mark_low_prime (bitmap, max, 7); mark_low_prime (bitmap, max, 11); mark_low_prime (bitmap, max, 13);
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <string.h> #include <inttypes.h>typedef uint64_t v32_4si __attribute__((vector_size(32))); static inline void mark_low_prime (char *bitmap, uint64_t max, int a) { int r, n; char *pattern = aligned_alloc(32, a*32); register v32_4si v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, va, vb, vc; v32_4si *out = (v32_4si *)bitmap; bzero(pattern, a*32); for (r = 0; r < a*32*8; r += a) pattern[r/8] |= 1ul << r%8; switch (a) { case 13: vc = *(v32_4si *)(pattern + 32*12); vb = *(v32_4si *)(pattern + 32*11); case 11: va = *(v32_4si *)(pattern + 32*10); v9 = *(v32_4si *)(pattern + 32*9); v8 = *(v32_4si *)(pattern + 32*8); v7 = *(v32_4si *)(pattern + 32*7); case 7: v6 = *(v32_4si *)(pattern + 32*6); v5 = *(v32_4si *)(pattern + 32*5); case 5: v4 = *(v32_4si *)(pattern + 32*4); v3 = *(v32_4si *)(pattern + 32*3); case 3: v2 = *(v32_4si *)(pattern + 32*2); case 2: v1 = *(v32_4si *)(pattern + 32*1); v0 = *(v32_4si *)(pattern + 32*0); } n = max/8/32/a + 1; while (n--) { *out++ |= v0; *out++ |= v1; if (a == 2) continue; *out++ |= v2; if (a == 3) continue; *out++ |= v3; *out++ |= v4; if (a == 5) continue; *out++ |= v5; *out++ |= v6; if (a == 7) continue; *out++ |= v7; *out++ |= v8; *out++ |= v9; *out++ |= va; if (a == 11) continue; *out++ |= vb; *out++ |= vc; } }void calc_primes(char *bitmap, uint64_t max) { uint64_t a, r;mark_low_prime (bitmap, max, 2); mark_low_prime (bitmap, max, 3); mark_low_prime (bitmap, max, 5); mark_low_prime (bitmap, max, 7); mark_low_prime (bitmap, max, 11); mark_low_prime (bitmap, max, 13);for (a = 17; a < sqrtl(max); a++) { if (bitmap[a/8] & 1 << (a%8)) continue; for (r = a * a; r <= max; r += a) bitmap[r/8] |= 1 << (r%8); } /* The first 16 bytes is ruined by the pattern setting */ bitmap[0] = 0x53; bitmap[1] = 0xD7; } uint64_t count_primes(char *bitmap, uint64_t max_bit) { uint64_t *p = (uint64_t *)bitmap, c = max_bit / 64, count = max_bit + 1; while (c--) count -= __builtin_popcountll(*p++); count -= __builtin_popcountll(*p & (~0ul >> (63 - (max_bit % 64)))); return count; } int main(int argc, char *argv[]) { uint64_t max = argc > 1 ? atol(argv[1]) : 1000000000ull; char *bitmap = aligned_alloc(32, max/8 + 32*16*2 + 1); bzero(bitmap, max/8 + 32*16*2 + 1); calc_primes(bitmap, max); printf("%ld\n", count_primes(bitmap, max)); free(bitmap); return EXIT_SUCCESS; }