This is a continuation of the previous post

Thanks to the commenters who suggested using other datastructures rather than an unordered array. The reason I'm going for linear search is that it allows the searching of an index sorted for other purposes. The fewer indexes I use, the less data has to be in memory and the larger datasets I can work with. In short, fast-enough linear searching means more data in memory.

I'm pretty sure I'm not going to get near the 100ms at 800mhz I was aiming for yesterday. I wrote the assembler code this evening and did some preliminary tests.

The idea is that I load each element in the array, mod the value to the size of the bloom filter and then see if the corresponding bit is set in the filter and escape out if so. X86 assembler has a BT instruction that tests if an individual bit in memory is set. That's got to be faster than doing all the loading ANDing and bitshifting myself.

Unfortunately the factor X86 assembler doesn't support the BT instruction since it's just designed for generating machinecode for the compiler. A quick look at the intel manuals gave me the opcodes I needed to add it though:


: BT ( operand operand -- ) { HEX: 0F HEX: A3 } 2-operand ;

The assembler loop that does the work looks like this in factor's X86 DSL:


: %idx-matching-filter ( -- )
  "loop" define-label "end" define-label

  "filter" get dup %unbox-byte-array               ! unbox input params
  "ptr" get dup %unbox-alien
  "i" operand %untag-fixnum
  "val" operand %untag-fixnum        ! val contains length at this point

  ds-reg [] "val" operand MOV        ! free up val register for tmp use

"loop" resolve-label                      
  "val" operand "ptr" operand "i" operand [+] MOV  
  "val" operand 1023 AND   ! val mod 1024 
  "val" operand "filter" operand [] BT
  "end" get JB

  "i" operand 4 ADD
  "i" operand ds-reg [] CMP
  "loop" get JNE

"end" resolve-label
   "i" operand %tag-fixnum
;

Notice that it's full of '"foo" operand' clauses - factor maps these to registers (e.g. EAX, EBX etc..) according to function parameters. This is really handy and in this case saves doing the stack->register mapping manually. For completeness, here's the gas dissassembly which includes unpacking registers from the stack and unboxing integers:


( scratchpad ) \ idx-matching-filter disassemble
[Thread debugging using libthread_db enabled]
[New Thread 0xb79dc6c0 (LWP 26068)]
[New Thread 0xb12f1b90 (LWP 26069)]
0xb7f18410 in __kernel_vsyscall ()
Dump of assembler code from 0xb17028e0 to 0xb1702927:
0xb17028e0: mov    $0xb17028e0,%ebx
0xb17028e5: mov    -0xc(%esi),%eax
0xb17028e8: mov    -0x8(%esi),%ecx
0xb17028eb: mov    -0x4(%esi),%edx
0xb17028ee: mov    (%esi),%ebp
0xb17028f0: lea    0x5(%eax),%eax
0xb17028f3: mov    0x9(%ecx),%ecx
0xb17028f6: sar    $0x3,%edx
0xb17028f9: sar    $0x3,%ebp
0xb17028fc: mov    %ebp,(%esi)
0xb17028fe: mov    (%ecx,%edx,1),%ebp
0xb1702901: and    $0x3ff,%ebp
0xb1702907: bt     %ebp,(%eax)
0xb170290a: jb     0xb170291b
0xb1702910: add    $0x4,%edx
0xb1702913: cmp    (%esi),%edx
0xb1702915: jne    0xb17028fe
0xb170291b: shl    $0x3,%edx
0xb170291e: mov    %edx,-0xc(%esi)
0xb1702921: sub    $0xc,%esi
0xb1702924: ret    

I've got to go to bed soon so I just tried the best case of having an empty filter (and thus no matches): I was able to get 171ms for the 48MB array on a processor pegged at 800mhz, which isn't the sub-100ms I was hoping for but is still pretty good. I'll maybe experiment with real data and filter sizes tomorrow if I get any spare time in the evening.