sum2bits - variable-precision SWAR algorithm
This post is intend to understand a piece of code of the OSRM project, which also posted in Telenav/open-source-spec/sum2bits-swar.
Checkout Project-OSRM/OSRM-backend, Telenav/osrm-backend and Telenav/open-source-spec for more if you’re interested.
When read the .osrm.names processing related codes, I found a function named sum2bits() that looks a little strange to me. It’s used for summation of 16 2-bit values using SWAR according to comment, but what I can see in codes are a lot of bitwise operations with some magic numbers. It looks no sense to me.
What does it exactly do? Let’s try to understand it step-by-step.
/// Summation of 16 2-bit values using SWAR
inline std::uint32_t sum2bits(std::uint32_t value) const
{
value = (value >> 2 & 0x33333333) + (value & 0x33333333);
value = (value >> 4 & 0x0f0f0f0f) + (value & 0x0f0f0f0f);
value = (value >> 8 & 0x00ff00ff) + (value & 0x00ff00ff);
return (value >> 16 & 0x0000ffff) + (value & 0x0000ffff);
}
Background
When processing .osrm.names, OSRM internally uses a uint32_t
to store 16
seperate values for saving memory. In another word, every 2
-bits represents a independent value in the uint32_t
, and each independent value will be [0,3]
of course.
The 2
-bits value is used to record how many bytes will be sufficient to store a length of char
s array, and OSRM need to get the summarization of them for calcuating something like offset
later.
Here’s a example:
- If we have a
char
arrays
:char *s = "abcedfg"; // length: 7 bytes
; s
’s length is7
, so1
byte should be enough to store the length value7
;- If we use the last
2
-bits ofuint32_t d
to store how many bytes for the length value7
, the2
-bits will be0b01
(means1
byte); - There’ll be a lot of
char
arrays with very different lengths, some of them might bigger than255
(beyond1
byte’s range) but shorter than65536
, then at least2
bytes are necessary to store the length, and we could fill in the2
-bits as0b10
; same thing can be applied on array length range[65536, 2^24]
, the2
-bits have to be0b11
.- The
log256()
implements the idea. Thelog
is mathematics logarithm./// Returns ceiling(log_256(value + 1)) inline std::uint32_t log256(std::uint32_t value) const { BOOST_ASSERT(value < 0x1000000); return value == 0 ? 0 : value < 0x100 ? 1 : value < 0x10000 ? 2 : 3; }
- The
- Once the
uint32_t d
have stored16
independent values, OSRM wants to know totally how many bytes will be necessary for store these16
lengths, that’s why the sum2bits() comes.
What sum2bits achieves
Assume we have 16
arrays with lengths 1,10,20,30,40,50,100,200,250,300,500,1000,5000,10000,50000,100000
, we’ll need 1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,3
bytes to store these lengths(refer to below table for the mapping).
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
array length | 1 | 10 | 20 | 30 | 40 | 50 | 100 | 200 | 250 | 300 | 500 | 1000 | 5000 | 10000 | 50000 | 100000 |
bytes to store length | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 3 |
If we use uint32_t d
to store the “bytes to store length”, we’ll get
d = 0b01010101010101010110101010101011 // 0x55556AAB
Totally “bytes to store length” of d
is 1+1+1+1+1+1+1+1+1+2+2+2+2+2+2+3=24
.
The sum2bits() is designed to calculate sum2bits(d) = 24
.
The unittests show more examples.
BOOST_CHECK_EQUAL(variable_group_block.sum2bits(0xe4), 6);
BOOST_CHECK_EQUAL(variable_group_block.sum2bits(0x11111111), 8);
BOOST_CHECK_EQUAL(variable_group_block.sum2bits(0x55555555), 16);
BOOST_CHECK_EQUAL(variable_group_block.sum2bits(0xffffffff), 48);
sum2bits_normal
Normally we may implement the sum2bits() as below. It requires loop 16
times to get result, then the performance may be not good enough due to the loops if we have many uint32_t
s to calculate.
The sum2bits() implementation has no loop, looks much more efficient.
std::uint32_t sum2bits_normal(std::uint32_t value)
{
std::uint32_t sum = 0;
for (int i = 0; i < 16; ++i) {
sum += (value >> (i*2)) & 0x3;
}
return sum;
}
How does sum2bits work
Now it’s time to understand the magic.
/// Summation of 16 2-bit values using SWAR
inline std::uint32_t sum2bits(std::uint32_t value) const
{
value = (value >> 2 & 0x33333333) + (value & 0x33333333); // (A)
value = (value >> 4 & 0x0f0f0f0f) + (value & 0x0f0f0f0f); // (B)
value = (value >> 8 & 0x00ff00ff) + (value & 0x00ff00ff); // (C)
return (value >> 16 & 0x0000ffff) + (value & 0x0000ffff); // (D)
}
Line (A)
There’re 4 lines in the sum2bits()
function. Let’s take a look on the line (A)
first.
First of all, the significance of the constant 0x33333333
is that, written using the Java
/GCC
style binary literal notation,
0x33333333 = 0b00110011001100110011001100110011
That is, 8 repeated 0011
block(per 4 bits). Well, let’s see what would happen if the value
is only 1 block(4 bits).
value_before | left(v >> 2 & 0x3 ) |
right(v & 0x3 ) |
value_after(left+right) |
---|---|---|---|
0001 |
0000 |
0001 |
0001 = 1 |
0010 |
0000 |
0010 |
0010 = 2 |
0011 |
0000 |
0011 |
0011 = 3 |
0100 |
0001 |
0000 |
0001 = 1 |
0101 |
0001 |
0001 |
0010 = 2 |
0110 |
0001 |
0010 |
0011 = 3 |
0111 |
0001 |
0011 |
0100 = 4 |
1000 |
0010 |
0000 |
0010 = 2 |
1001 |
0010 |
0001 |
0011 = 3 |
1010 |
0010 |
0010 |
0100 = 4 |
1011 |
0010 |
0011 |
0101 = 5 |
1100 |
0011 |
0000 |
0011 = 3 |
1101 |
0011 |
0001 |
0100 = 4 |
1110 |
0011 |
0010 |
0101 = 5 |
1111 |
0011 |
0011 |
1010 = 6 |
We exactly get sum of the two 2
-bits(e.g. 01+10=11=0011
)! And the result still in the original 4 bits!
Other 7 repeated 4-bits blocks will do the same thing! But please be aware that all the 8 repeated 4-bits blocks calculate at the same time, that’s why this algorithm also known as “parallel”. The algorithm calculates 8 blocks in parallel without any loop.
Line (B),(C),(D)
After Line (A) done, the problem becomes to sum 8 values(each one stores in 4
-bits) instead of orginal 16 values(each 2
-bits).
The significance of the other constants are
0x0f0f0f0f=0b00001111000011110000111100001111
0x00ff00ff=0b00000000111111110000000011111111
0x0000ffff=0b00000000000000001111111111111111
Very similar style, right?
The Line (B),(C),(D)
do exactly the same as the previous line (A)
, except they adds the adjacent 4/8/16
-bit bitcounts together to give the bitcounts of each 8/16/32
-bit block.
Conclusion
The variable-precision SWAR algorithm in this binary case is also knowns as population count, popcount, sideways sum, or bit summation. The concept comes from Hamming weight, and there’re a lot of CPUs provide built-in instructions to do it. This algorithm is the best solution if your processors lacking those features.