How to fill memory fast with a `int32_t` value?

Is there a function (SSEx intrinsics is OK) which will fill the memory with a specified int32_t value? For instance, when this value is equal to 0xAABBCC00 the result memory should look like:


I could use std::fill or simple for-loop, but it is not fast enough.

Resizing of a vector performed only once in the beginning of program, this is not an issue. The bottleneck is filling the memory.

Simplified code:

struct X
  typedef std::vector<int32_t> int_vec_t;
  int_vec_t buffer;

  X() : buffer( 5000000 ) { /* some more action */ }
  ~X() { /* some code here */ }

  // the following function is called 25 times per second
  const int_vec_t& process( int32_t background, const SOME_DATA& data );

const X::int_vec_t& X::process( int32_t background, const SOME_DATA& data )
    // the following one string takes 30% of total time of #process function
    std::fill( buffer.begin(), buffer.end(), background );

    // some processing
    // ...

    return buffer;


This is how I would do it (please excuse the Microsoft-ness of it):

VOID FillInt32(__out PLONG M, __in LONG Fill, __in ULONG Count)
    __m128i f;

    // Fix mis-alignment.
    if ((ULONG_PTR)M & 0xf)
        switch ((ULONG_PTR)M & 0xf)
            case 0x4: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0x8: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0xc: if (Count >= 1) { *M++ = Fill; Count--; }

    f.m128i_i32[0] = Fill;
    f.m128i_i32[1] = Fill;
    f.m128i_i32[2] = Fill;
    f.m128i_i32[3] = Fill;

    while (Count >= 4)
        _mm_store_si128((__m128i *)M, f);
        M += 4;
        Count -= 4;

    // Fill remaining LONGs.
    switch (Count & 0x3)
        case 0x3: *M++ = Fill;
        case 0x2: *M++ = Fill;
        case 0x1: *M++ = Fill;

I have to ask: Have you definitely profiled std::fill and shown it to be the performance bottleneck? I would guess it to be implemented in a pretty efficient manner, such that the compiler can automatically generate the appropriate instructions (for example -march on gcc).

If it is the bottleneck, it may still be possible to get better benefit from an algorithmic redesign (if possible) to avoid setting so much memory (apparently over and over) such that it doesn't matter anymore which fill mechanism you use.

Thanks to everyone for your answers. I've checked wj32's solution , but it shows very similar time as std::fill do. My current solution works 4 times faster (in Visual Studio 2008) than std::fill with help of the function memcpy:

 // fill the first quarter by the usual way
 std::fill(buffer.begin(), buffer.begin() + buffer.size()/4, background);
 // copy the first quarter to the second (very fast)
 memcpy(&buffer[buffer.size()/4], &buffer[0], buffer.size()/4*sizeof(background));
 // copy the first half to the second (very fast)
 memcpy(&buffer[buffer.size()/2], &buffer[0], buffer.size()/2*sizeof(background));

In the production code one needs to add check if buffer.size() is divisible by 4 and add appropriate handling for that.

Have you considered using

vector<int32_t> myVector;
myVector.reserve( sizeIWant );

and then use std::fill? Or perhaps the constructor of a std::vector which takes as an argument the number of items held and the value to initialize them at?

Assuming you have a limited amount of values in your background parameter (or even better, only on), maybe you should try to allocate a static vector, and simply use memcpy.

const int32_t sBackground = 1234;
static vector <int32_t> sInitalizedBuffer(n, sBackground);

    const X::int_vec_t& X::process( const SOME_DATA& data )
        // the following one string takes 30% of total time of #process function
        std::memcpy( (void*) data[0], (void*) sInitalizedBuffer[0], n * sizeof(sBackground));

        // some processing
        // ...

        return buffer;

I just tested std::fill with g++ with full optimizations (SSE etc.. enabled):

#include <algorithm>
#include <inttypes.h>

int32_t a[5000000];

int main(int argc,char *argv[])
    return a[3];

and the inner loop looked like:

    movdqa  %xmm0, -16(%eax)
    addl    $16, %eax
    cmpl    %edx, %eax
    jne L2

Looks like 0xAABBCC00 x 4 was loaded into xmm0 and is being moved 16-bytes at a time.

Not totally sure how you set 4 bytes in a row, but if you want to fill memory with just one byte over an over again, you can use memset.

void * memset ( void * ptr, int value, size_t num );

Fill block of memory

Sets the first num bytes of the block of memory pointed by ptr to the specified value (interpreted as an unsigned char).

the vs2013 and vs2015 can optimize a plain for-loop to a rep stos instruction. It's the fastest way to fill a buffer. You can specify the std::fill for your type like this:

namespace std {
    inline void fill(vector<int>::iterator first, vector<int>::iterator last, int value){
        for (size_t i = 0; i < last - first; i++)
            first[i] = value;

BTW. To have the compiler do the optimization, the buffer must be accessed by the subscript operator.

It will not work on the gcc and clang. They both will compile the code to a conditional jump loop. It runs as slow as the original std::fill. And though the wchar_t is 32-bit, the wmemset does not have an assemble implement likes the memset. So you have to write assemble code to do the optimization.

It might be a bit non portable but you could use an overlapping memory copy. Fill the first four bytes with the pattern you want and use memcpy().

int32* p = (int32*) malloc( size );
*p = 1234;
memcpy( p + 4, p, size - 4 );

don't think you can get much faster

Need Your Help

Explain the use of a bit vector for determining if all characters are unique

java string bit-manipulation bitvector

I am confused about how a bit vector would work to do this (not too familiar with bit vectors). Here is the code given. Could someone please walk me through this?