Premature optimization is the root of all evil, but...

Started by Kerame Pxel Nume, June 07, 2010, 09:51:31 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Kerame Pxel Nume

when developing stuff one should at least keep his mind on things like L1 code cache behaviour:
http://x264dev.multimedia.cx/?p=201

Now take this small C++ snippet, which uses STL for only two things (a vector and a algorithm), compile it with
g++ -Os -c $SRCFILE


#include <vector>
#include <algorithm>
#include <iostream>

struct vertex;

void serialize_vertex(vertex const &);

struct vertex {
public:
vertex(float x, float y, float z)
{
pos[0] = x;
pos[1] = y;
pos[2] = z;
}

vertex& operator=(vertex const &v)
{
for(int i=0; i<3; ++i)
pos[i] = v.pos[i];

return *this;
}

private:
float pos[3];

friend void serialize_vertex(vertex const &);
};

typedef std::vector<vertex> vertex_vector;

void serialize_vertex(vertex const &v)
{
std::cout << v.pos[0] << ' ' << v.pos[1] << ' ' << v.pos[2] << std::endl;
}

void serialize_vertex_vector(vertex_vector &v)
{
std::for_each(v.begin(), v.end(), serialize_vertex);
}

int main()
{
vertex_vector V;
for(int i=0; i < 50; ++i) {
V.push_back(vertex(rand(), rand(), rand()));
}

serialize_vertex_vector(V);

return 0;
}


And look at it's size. If I compile with gcc-4.3.4 for x86_64 it's about 0.8k in the .text segment.

Okay, that't not a problem yet, but now imagine if you'd implement a real algorithm, like eliminating all double vertices in a terrain grid (after LOD you may and up with exactly those). Aside the fact that STL algorithm is suboptimal for that ( O(n²) in a naive program, also a lot of data gets copied around due to RAII), You'll easily go over that L1 code cache limit of 32K.

Now let's look at the equivalent C program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct vertex {
        float pos[3];
} vertex;

typedef struct vertex_array {
        size_t len;
        vertex *v;
} vertex_array;

void vertex_array_push_vertex(vertex_array * const VA, vertex const * const v)
{
        vertex * new = 0;
        if(!VA)
                return;

        new = realloc(VA->v, sizeof(vertex)*(VA->len + 1));
        if(!new)
                return;

        VA->v = new;
        memcpy(VA->v + VA->len, v, sizeof(vertex));
        VA->len++;
}

void serialize_vertex(vertex const * const v)
{
        printf("%f %f %f\n", v->pos[0], v->pos[1], v->pos[2]);
}

void serialize_vertex_array(vertex_array const * const V)
{
        int i;
        for(i=0; i < V->len; ++i)
                serialize_vertex(V->v + i);
}

int main()
{
        int i;
        vertex_array V = {0,0};

        for(i=0; i < 50; ++i) {
                vertex v = {{rand(), rand(), rand()}};
                vertex_array_push_vertex(&V, &v);
        }

        serialize_vertex_array(&V);

        return 0;
}


This does the very same like the C++ equivalent, yet its .text segment size is only 0.3k. And since you keep working on the very same dataset (no RAII, no duplicates) as a side effect you also don't trash the L1 data cache and the L2 cache in total (which is shared by code and data). So if you've got a complex algorithm, doing it in pure C will give you that reduction in size of fit it into L1 code cache, resulting in a order of magnitude better performance.

bommel

Sometimes I even break it down to asm level (for example when using SSE).
For all of you interested, I can recommend the Intel Software Optimization guide as a good reference. You can download it for free at Intel (don't have the link atm) or pm me.