2011-10-03

How to profile JIT code with VTune

I wrote the way to profile JIT code with CodeAnalyst before. Here, I write how to profile it with VTune.

If you profile a program using JIT code, then VTune can't know about it.


The Function "[Unknown stack frame(s)]" in this screenshot shows a JIT code, but VTune can't go into details.

So, lets' tell VTune the information of JIT code by calling VTune's API.

At first, include "jitprofiling.h" in /Intel/VTune Amplifier XE/include and link some libraries.


#include "jitprofiling.h"
#pragma comment(lib, "libittnotify.lib")
#pragma comment(lib, "jitprofiling.lib")

Next, make an utility function to register a JIT code as following:

void SetJitCode(void *ptr, size_t size, const char *name)
{
  iJIT_Method_Load jmethod = {0};
  jmethod.method_id = iJIT_GetNewMethodID();
  jmethod.class_file_name = "";
  jmethod.source_file_name = __FILE__;

  jmethod.method_load_address = ptr;
  jmethod.method_size = size;
  jmethod.line_number_size = 0;

  jmethod.method_name = const_cast(name);
  int ret = iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED, (void*)&jmethod);
  printf("iJIT_NotifyEvent ret=%d\n", ret);
}

Call SetJitCode with a pointer of JIT function, size of it and name in your code.

Finaly, call iJIT_NotifyEvent(iJVM_EVENT_TYPE_SHUTDOWN, NULL); to notify VTune about the end of profiling.


Profile the modified program with VTune, then we can get the following results:



The area of "[Unknown stack frame(s)]" splits into some different functions such as "Fp2Dbl::mod", "Fp2Dbl::mulOpt2" and so on.
Moreover, we can see the generated assembly code of these functions.

 

I write a sample code for JIT profiling.
Please see prof.cpp and mkprof.bat.

2011-09-20

How to profile JIT code with CodeAnalyst

CodeAnalyst is a good profiler tool.

It is easy to use, but it is difficult to analyze JIT code in standard way.
For example, let's try the following JIT sample code with Xbyak.

// a sample for JIT code
// cl t.cpp /EHsc /Zi /Ox
#include <xbyak/xbyak.h>
struct Code : public Xbyak::CodeGenerator {
    Code()
    {
        mov(eax, 1000000);
    L("@@");
        for (int i = 0; i < 10; i++) {
            sub(eax, 1);
        }
        jg("@b");
        mov(eax, 1);
        ret();
    }
};

struct Code2 : public Xbyak::CodeGenerator {
    Code2()
    {
        mov(eax, 1000000);
    L("@@");
        for (int i = 0; i < 10; i++) {
            xorps(xm0, xm0);
        }
        sub(eax, 1);
        jg("@b");
        mov(eax, 1);
        ret();
    }
};

int main()
{
    Code c;
    Code2 c2;
    int (*f)() = (int (*)())c.getCode();
    int (*g)() = (int (*)())c2.getCode();
    double sum = 0;
    for (int i = 0; i < 20000; i++) {
        sum += s1(i);
        sum += s2(i);
    }
    printf("sum=%f\n", sum);
    for (int i = 0; i < 2000; i++) {
        sum += f();
    }
    printf("f=%f\n", sum);
    for (int i = 0; i < 2000; i++) {
        sum += g();
    }
    printf("g=%f\n", sum);
}

I get the result of profile by CodeAnalyst.


The module t.exe is target program and in addtion, there is an unknown module pid(4972). We can't get more information about it if we click it.



The profiler can't analyse JIT code because there is no symbol information about it.

AMD provides an API to deal with JIT code, so let's use it. To my regret, there is no documents for API, so I fumble with it. The version of CodeAnalyst I used is 3.2.962.731.

At first, include CAJITNTFLib.h in the <installed directory>/CodeAnalyst/API/include. Three functions are defined in the header.
  1. Initialization
    CAJIT_Initialize();
  2. Register
    CAJIT_LogJITCode(); Specify the address and the size of JIT code.
  3. Termination
    CAJIT_CompleteJITLog();

Here is a sample code with these APIs.

 
#include <stdio.h>
#include <math.h>
#include <xbyak/xbyak.h>

#ifdef _WIN64
#define AMD64
#include "CAJITNTFLib.h"
#pragma comment(lib, "CAJitNtfyLib.lib")

struct Code : public Xbyak::CodeGenerator {
    Code()
    {
        mov(eax, 1000000);
    L("@@");
        for (int i = 0; i < 10; i++) {
            sub(eax, 1);
        }
        jg("@b");
        mov(eax, 1);
        ret();
    }
};

struct Code2 : public Xbyak::CodeGenerator {
    Code2()
    {
        mov(eax, 1000000);
    L("@@");
        for (int i = 0; i < 10; i++) {
            xorps(xm0, xm0);
        }
        sub(eax, 1);
        jg("@b");
        mov(eax, 1);
        ret();
    }
};

double s1(int n)
{
    double r = 0;
    for (int i = 0; i < n; i++) {
        r += 1.0 / (i + 1);
    }
    return r;
}

double s2(int n)
{
    double r = 0;
    for (int i = 0; i < n; i++) {
        r += 1.0 / (i * i + 1) + 2.0 / (i + 3);
    }
    return r;
}

int main()
{
    Code c;
    Code2 c2;
    int (*f)() = (int (*)())c.getCode();
    int (*g)() = (int (*)())c2.getCode();
    printf("f:%p, %d\n", f, c.getSize());
    printf("g:%p, %d\n", g, c2.getSize());

    CAJIT_Initialize();
    CAJIT_LogJITCode((size_t)f, c.getSize(), L"f");
    CAJIT_LogJITCode((size_t)g, c2.getSize(), L"g");

    double sum = 0;
    for (int i = 0; i < 20000; i++) {
        sum += s1(i);
        sum += s2(i);
    }
    printf("sum=%f\n", sum);
    for (int i = 0; i < 2000; i++) {
        sum += f();
    }
    printf("f=%f\n", sum);
    for (int i = 0; i < 2000; i++) {
        sum += g();
    }
    printf("g=%f\n", sum);
    CAJIT_CompleteJITLog();
    puts("end");
}

Build this code with correct path of include and lib of API, then I get the following result.






"JIT Code" Process Name appeared in stead of unknown module pid and I get the name "f" and "g" symbols.


We can see the generated JIT code if clicking the symbols!
(The snapshot is for Code2)

2011-08-27

fast double precision exponential function with SSE

I make a fast double precision exponential function using SSE2.

fmath.hpp (https://github.com/herumi/fmath, fast approximate float function fmath)





benchmark of fmath::expd
CPUOScompilerstd::expfmath::expdone element for fmath::expd_v(array version)
Xeon X5650 2.67GHz64-bit Linuxgcc 4.6.0128.8927.3817.84
i7-2600 3.4GHz64-bit Linuxgcc 4.4.569.1112.108.25
i7-2600 3.4GHz64-bit Windows 7VC1036.3314.377.08

The function double fmath::expd(double) defined in fmath.hpp is about five time faster than std::exp of gcc-4.6 on 64-bit Linux and about two point five faster than that of Visual Studio 2010 on 64-bit Windows.

The error of rms (Root Mean Square) for 1000000 points generated from standard normal distribution is about 1.117645e-16.

The source code for benchmark is fastexp.cpp, which requires Xbyak.

I write some results for various environments in the comment of the header of fastexp.cpp.

Moreover, fmath.hpp provies fmath::exp(float) and fmath::log(float).
These functions are also 2~5 times faster than those of standard library.

Let's try it if you want speed.