The Bug 🐛
I faced a very weird bug lately. I have to split a disk into multiple equally sized blocks and generate hashes of those blocks. I wrote this module three months before and all my unit test cases always pass too.
One fine morning, all of a sudden the the unit test started failing. Even more weird, when I ran it again, it started passing. Once in say, 20 times, the test was failing.
To enable fast generation of block hashes, I was running parallel jobs to create them. So, multiple threads generate hashes for different blocks and store it in a
HashMap<Integer, byte>. Once all the threads finish, all the HashMap are merged, asserted that all the blocks are accounted for and the huge
HashMap with all the hashes for blocks are returned.
Zeroing in on the culprit
After running the test cases again and again repeatedly, and failing to figure out any noticeable pattern, I decided to take up the age old process-of-elimination. One thing I was able to notice was, the incorrect hashes for the blocks were not random but, were repeating from other blocks. So, hashes were repeating and getting shuffled around. And again, this happens only once in twenty times. Other times, it works out pretty alright.
Running the hashing with a single thread, magically seemed to solve the problem. There were no more randomly failing test cases. But, this was a troubling scenario. Parallel jobs are required very much for the job to run properly. I cannot eliminate it. On the other hand, I was not sharing anything between the threads (or so I thought) for them to be messing with.
My initial hypothesis was that after generation of hashes, the storage was getting messed up. Because, that was the only object I was sharing between the threads then. I pass a reference to the storage object which all the threads use
put() method to store the block identifier and the corresponding hash. I took care to use
Hashtable so that there are synchronised writes and there are no race conditions possible there.
Race Conditions while Reading
I was having this notion that since, I am not writing anything to disk and all the threads are only trying to read data at the same time, race conditions should not be possible, right?
Race conditions are very much possible. It depends on the way the threads are run. I realized that I was unintentionally sharing the file descriptor (actually a
RandomFileAccess instance) for the raw device file among all the threads (naively justifying myself that I am only reading data).
And then it hit me. All the threads competing with each other, trying to skip bytes, and read bytes of data. The problem was not in storage of hashes, the threads were generating hashes for incorrect memory locations. Because, of the shared file descriptor.
Instead of sharing the
RandomFileAccess instance, I let the threads open the disk for reading and have their own file descriptors. The unit tests started to pass again and no more random failures.
The Devil is in the Details. Here, it is in the internal structure of