2012-06-20

AA-sort with SSE4

AA-sort is an integer sorting algorithm, which exploits SIMD and multi-core. It is proposed by H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani at 2007(see "A high-performance sorting algorithm for multicore single-instruction multiple-data processors").

I tried to implement it on x86/x64 with SSE4(for only one processor) and verified that it is 2.8~4 times faster than std::sort(STL) for random data.
The source code is https://github.com/herumi/opti/blob/master/intsort.hpp and implementation detail is AA-sort with SSE4.1.

2012-04-12

stristr and findiStr with SSE4.2

I made a function stristr() same as strcasestr() and findiStr(), which is ranged version of stristr(). These functions use SSE4.2 string instructions and three times faster than strcasestr() of glibc. The souce code is https://github.com/herumi/opti/blob/master/str_util.hpp. Please see Quick search algorithm and strstr.

2012-04-04

Quick search algorithm and strstr

Quick Search algorithm(Qs) is a string search algorithm. Qs is a simplification of Boyer-Moore algorithm and faster than it.

cf. http://www-igm.univ-mlv.fr/~lecroq/string/node19.html

So I measured the speed of Qs on Core 2 Duo and found it is faster than strstr for a key string with 4~6 bytes or longer length.

But latest Intel CPU such as Sandy Bridge(Core i7-2600K) or Xeon 5650 has SSE4.2 (string and test processing instructions). So I compared strstr_sse42 with Qs then found strstr_sse42 is 2~4 times faster than Qs for a key string with at least 32-bytes length.

The test code is https://github.com/herumi/opti/blob/master/str_util_test.cpp.

My implementation of strstr is https://github.com/herumi/opti/blob/master/str_util.hpp. It requires Xbyak.

The main code is


/*
  strstr(a, key);
  input key
  use save_a, save_key, c
*/
  movdqu(xm0, ptr [key]); // xm0 = *key
L(".lp");
  pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte]
  if (isSandyBridge) {
    lea(a, ptr [a + 16]);
    ja(".lp"); // if (CF == 0 and ZF = 0) goto .lp
  } else {
    jbe(".headCmp"); // if (CF == 1 or ZF == 1) goto .headCmp
    add(a, 16);
    jmp(".lp");
L(".headCmp");
  }
  jnc(".notFound");
  ...

It is slower to use lea with pcmpistri on Xeon 5650 but a little faster on Sandy Bridge, so I check the type of CPU and generate better code according to each CPU.

I make a slide for detail, so please see http://www.slideshare.net/herumi/x86opti3.