Dec. 19th, 2013

Recently I have been looking into some performance issues on large multi-processor systems like my dual Opteron 6344 server. I last wrote about this in May. Arasan does not scale well on these systems, especially in the endgame.

I have tried various tools to identify performance bottlenecks, including AMD's CodeXL, TAU, and oprofile. Of these I finally found oprofile the most useful. It can use AMD's instruction based sampling (IBS) feature, which uses hardware counters on the chips to precisely measure activity in a way that can be traced to particular source lines (Intel chips have similar features, but my big server box is AMD).

What I have found is that there is non-local memory access at several places in the program. One source of this type of access is the global hash table. But there are at least two others: one is the evaluation cache, which is currently also global, and the other is the caches maintained by the Scoring module, which are per-thread.

You would think allocating memory on a per-thread basis would alleviate this issue, and to some extent it does. I have recently modified the code to apply the "first touch" rule, so the memory used by a thread is allocated and initialized by that thread when it starts up. This forces the system to place that memory local to the node on which the thread is running (AMD actually has 2 NUMA nodes per physical processor, so 4 nodes on my system).

However, in my code, eventually the thread will enter an idle state controlled by a condition variable/mutex. When that lock is released and the thread is active again, it may have been assigned to a different node to run. And in the current state of things, Linux does not move its memory to that other node, and so the thread will start to do non-local memory access. (There are some proposals to do memory migration in Linux, but this is not implemented yet, as far as I know).

I have some experimental code that adds some NUMA awareness so that caches are assigned to nodes and a thread always accesses the one that is on the local node. Too early to tell how much this will help, but it will probably improve things. I have found and fixed a few other hotspots, too.

Dec. 11th, 2013

Arasan 16.3 was released earlier this month. It has some bug fixes and search improvements.

Recently I have been looking into some issues in the global hashtable implementation. Coincidentally, there is a long discussion on some of these issues on talkchess.com.

Arasan, like many other programs, uses the "lockless hashing" method first described and implemented by Bob Hyatt. This method does not use a mutex or other exclusion mechanism to keep multiple threads from writing the hashtable. Instead it relies on an XOR operation that combines the hashcode for an entry with that entry's stored value. When storing, the XOR'ed value is put into the hashtable. When retrieving, the XOR of the hash code for the position and the stored value must match the stored derived hashcode value. This does not eliminate collisions that can result when one thread is writing the table and another is reading it, but it does greatly reduce the frequency of that.

In looking at how that was done in Arasan, I determined that the existing code was faulty, because it was not copying the hash entry before doing the equivalency test. So I fixed that. But still it appears the lockless hashing implementation is not bulletproof. There are two crucial pieces of information that are stored in a hashtable entry. One is the best move for the current position (if there is one) and the other is the check status (whether or not the side to move is in check). It is essential that this data not be invalid. Fetching and executing an illegal move from the hashtable can be fatal (leading to program crash) and so can mistakenly setting the check status to an incorrect value.

Arasan 16.3 does try to validate moves from the hashtable (w/o doing a strict legality check). After fixing the lockless hashing, I tried removing this check to save time, but I have concluded that is unsafe. I have also taken out the check status flag recently because relying on that instead of doing the check computation is also unsafe. I did not see problems frequently, but over the course of a long time control game, it was possible the program would start to get invalid data and crash. For some reason this was more frequent on Linux than Windows.

Anyway the current GitHub source has both the lockless hashing fix but also a more conservative approach to using data from the hash table, just in case the lockless hashing fails to work properly and a collision occurs.