herumi's technical notes

2012-06-20

AA-sort with SSE4

AA-sort is an integer sorting algorithm, which exploits SIMD and multi-core. It is proposed by H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani at 2007(see "A high-performance sorting algorithm for multicore single-instruction multiple-data processors").

I tried to implement it on x86/x64 with SSE4(for only one processor) and verified that it is 2.8~4 times faster than std::sort(STL) for random data.
The source code is https://github.com/herumi/opti/blob/master/intsort.hpp and implementation detail is AA-sort with SSE4.1.

2012-04-12

stristr and findiStr with SSE4.2

I made a function stristr() same as strcasestr() and findiStr(), which is ranged version of stristr(). These functions use SSE4.2 string instructions and three times faster than strcasestr() of glibc. The souce code is https://github.com/herumi/opti/blob/master/str_util.hpp. Please see Quick search algorithm and strstr.

2012-04-04

Quick search algorithm and strstr

Quick Search algorithm(Qs) is a string search algorithm. Qs is a simplification of Boyer-Moore algorithm and faster than it.

cf. http://www-igm.univ-mlv.fr/~lecroq/string/node19.html

So I measured the speed of Qs on Core 2 Duo and found it is faster than strstr for a key string with 4~6 bytes or longer length.

But latest Intel CPU such as Sandy Bridge(Core i7-2600K) or Xeon 5650 has SSE4.2 (string and test processing instructions). So I compared strstr_sse42 with Qs then found strstr_sse42 is 2~4 times faster than Qs for a key string with at least 32-bytes length.

The test code is https://github.com/herumi/opti/blob/master/str_util_test.cpp.

My implementation of strstr is https://github.com/herumi/opti/blob/master/str_util.hpp. It requires Xbyak.

The main code is

/*
  strstr(a, key);
  input key
  use save_a, save_key, c
*/
  movdqu(xm0, ptr [key]); // xm0 = *key
L(".lp");
  pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte]
  if (isSandyBridge) {
    lea(a, ptr [a + 16]);
    ja(".lp"); // if (CF == 0 and ZF = 0) goto .lp
  } else {
    jbe(".headCmp"); // if (CF == 1 or ZF == 1) goto .headCmp
    add(a, 16);
    jmp(".lp");
L(".headCmp");
  }
  jnc(".notFound");
  ...

It is slower to use lea with pcmpistri on Xeon 5650 but a little faster on Sandy Bridge, so I check the type of CPU and generate better code according to each CPU.

I make a slide for detail, so please see http://www.slideshare.net/herumi/x86opti3.

2011-10-03

How to profile JIT code with VTune

I wrote the way to profile JIT code with CodeAnalyst before. Here, I write how to profile it with VTune.

If you profile a program using JIT code, then VTune can't know about it.

The Function "[Unknown stack frame(s)]" in this screenshot shows a JIT code, but VTune can't go into details.

So, lets' tell VTune the information of JIT code by calling VTune's API.

At first, include "jitprofiling.h" in /Intel/VTune Amplifier XE/include and link some libraries.

#include "jitprofiling.h"
#pragma comment(lib, "libittnotify.lib")
#pragma comment(lib, "jitprofiling.lib")

Next, make an utility function to register a JIT code as following:

void SetJitCode(void *ptr, size_t size, const char *name)
{
  iJIT_Method_Load jmethod = {0};
  jmethod.method_id = iJIT_GetNewMethodID();
  jmethod.class_file_name = "";
  jmethod.source_file_name = __FILE__;

  jmethod.method_load_address = ptr;
  jmethod.method_size = size;
  jmethod.line_number_size = 0;

  jmethod.method_name = const_cast(name);
  int ret = iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED, (void*)&jmethod);
  printf("iJIT_NotifyEvent ret=%d\n", ret);
}

Call SetJitCode with a pointer of JIT function, size of it and name in your code.

Finaly, call iJIT_NotifyEvent(iJVM_EVENT_TYPE_SHUTDOWN, NULL); to notify VTune about the end of profiling.

Profile the modified program with VTune, then we can get the following results:

The area of "[Unknown stack frame(s)]" splits into some different functions such as "Fp2Dbl::mod", "Fp2Dbl::mulOpt2" and so on.
Moreover, we can see the generated assembly code of these functions.

I write a sample code for JIT profiling.
Please see prof.cpp and mkprof.bat.

2011-09-20

How to profile JIT code with CodeAnalyst

CodeAnalyst is a good profiler tool.

It is easy to use, but it is difficult to analyze JIT code in standard way.
For example, let's try the following JIT sample code with Xbyak.

// a sample for JIT code
// cl t.cpp /EHsc /Zi /Ox
#include <xbyak/xbyak.h>
struct Code : public Xbyak::CodeGenerator {
    Code()
    {
        mov(eax, 1000000);
    L("@@");
        for (int i = 0; i < 10; i++) {
            sub(eax, 1);
        }
        jg("@b");
        mov(eax, 1);
        ret();
    }
};

struct Code2 : public Xbyak::CodeGenerator {
    Code2()
    {
        mov(eax, 1000000);
    L("@@");
        for (int i = 0; i < 10; i++) {
            xorps(xm0, xm0);
        }
        sub(eax, 1);
        jg("@b");
        mov(eax, 1);
        ret();
    }
};

int main()
{
    Code c;
    Code2 c2;
    int (*f)() = (int (*)())c.getCode();
    int (*g)() = (int (*)())c2.getCode();
    double sum = 0;
    for (int i = 0; i < 20000; i++) {
        sum += s1(i);
        sum += s2(i);
    }
    printf("sum=%f\n", sum);
    for (int i = 0; i < 2000; i++) {
        sum += f();
    }
    printf("f=%f\n", sum);
    for (int i = 0; i < 2000; i++) {
        sum += g();
    }
    printf("g=%f\n", sum);
}

I get the result of profile by CodeAnalyst.

The module t.exe is target program and in addtion, there is an unknown module pid(4972). We can't get more information about it if we click it.

The profiler can't analyse JIT code because there is no symbol information about it.

AMD provides an API to deal with JIT code, so let's use it. To my regret, there is no documents for API, so I fumble with it. The version of CodeAnalyst I used is 3.2.962.731.

At first, include CAJITNTFLib.h in the <installed directory>/CodeAnalyst/API/include. Three functions are defined in the header.

Initialization
CAJIT_Initialize();
Register
CAJIT_LogJITCode(); Specify the address and the size of JIT code.
Termination
CAJIT_CompleteJITLog();

Here is a sample code with these APIs.

#include <stdio.h>
#include <math.h>
#include <xbyak/xbyak.h>

#ifdef _WIN64
#define AMD64
#include "CAJITNTFLib.h"
#pragma comment(lib, "CAJitNtfyLib.lib")

struct Code : public Xbyak::CodeGenerator {
    Code()
    {
        mov(eax, 1000000);
    L("@@");
        for (int i = 0; i < 10; i++) {
            sub(eax, 1);
        }
        jg("@b");
        mov(eax, 1);
        ret();
    }
};

struct Code2 : public Xbyak::CodeGenerator {
    Code2()
    {
        mov(eax, 1000000);
    L("@@");
        for (int i = 0; i < 10; i++) {
            xorps(xm0, xm0);
        }
        sub(eax, 1);
        jg("@b");
        mov(eax, 1);
        ret();
    }
};

double s1(int n)
{
    double r = 0;
    for (int i = 0; i < n; i++) {
        r += 1.0 / (i + 1);
    }
    return r;
}

double s2(int n)
{
    double r = 0;
    for (int i = 0; i < n; i++) {
        r += 1.0 / (i * i + 1) + 2.0 / (i + 3);
    }
    return r;
}

int main()
{
    Code c;
    Code2 c2;
    int (*f)() = (int (*)())c.getCode();
    int (*g)() = (int (*)())c2.getCode();
    printf("f:%p, %d\n", f, c.getSize());
    printf("g:%p, %d\n", g, c2.getSize());

    CAJIT_Initialize();
    CAJIT_LogJITCode((size_t)f, c.getSize(), L"f");
    CAJIT_LogJITCode((size_t)g, c2.getSize(), L"g");

    double sum = 0;
    for (int i = 0; i < 20000; i++) {
        sum += s1(i);
        sum += s2(i);
    }
    printf("sum=%f\n", sum);
    for (int i = 0; i < 2000; i++) {
        sum += f();
    }
    printf("f=%f\n", sum);
    for (int i = 0; i < 2000; i++) {
        sum += g();
    }
    printf("g=%f\n", sum);
    CAJIT_CompleteJITLog();
    puts("end");
}

Build this code with correct path of include and lib of API, then I get the following result.

"JIT Code" Process Name appeared in stead of unknown module pid and I get the name "f" and "g" symbols.

We can see the generated JIT code if clicking the symbols!
(The snapshot is for Code2)

2011-08-27

fast double precision exponential function with SSE

I make a fast double precision exponential function using SSE2.

fmath.hpp (https://github.com/herumi/fmath, fast approximate float function fmath)

benchmark of fmath::expd
CPU	OS	compiler	std::exp	fmath::expd	one element for fmath::expd_v(array version)
Xeon X5650 2.67GHz	64-bit Linux	gcc 4.6.0	128.89	27.38	17.84
i7-2600 3.4GHz	64-bit Linux	gcc 4.4.5	69.11	12.10	8.25
i7-2600 3.4GHz	64-bit Windows 7	VC10	36.33	14.37	7.08

The function double fmath::expd(double) defined in fmath.hpp is about five time faster than std::exp of gcc-4.6 on 64-bit Linux and about two point five faster than that of Visual Studio 2010 on 64-bit Windows.

The error of rms (Root Mean Square) for 1000000 points generated from standard normal distribution is about 1.117645e-16.

The source code for benchmark is fastexp.cpp, which requires Xbyak.

I write some results for various environments in the comment of the header of fastexp.cpp.

Moreover, fmath.hpp provies fmath::exp(float) and fmath::log(float).
These functions are also 2~5 times faster than those of standard library.

Let's try it if you want speed.