Diagnosing a multithreaded programming issue in Flink’s unit tests

Unfortunately, my contribution to resolve
FLINK-11568 Exception in Kinesis ShardConsumer hidden by InterruptedException introduced a race condition in the unit test code, leading to flaky unit tests reported in FLINK-12595. I made a pointed effort to fix the issue quickly, once it was brought to my attention. Race conditions can be difficult to experimentally reproduce and explain. In order to pinpoint the problem, I wanted to get a full understanding of the multithreaded test and surrounding code. So, I created a big ad-hoc diagram (mainly inspired by UML sequence diagrams) on a whiteboard which I thought would be fun to post here.

Each color represents a different thread. Blue is the unit test (on the left), black is the consumer thread (middle), and green is the KinesisShardConsumer thread (right). Ordering guarantees only exist within each lifeline (vertical line). Multiple lifelines can be run in parallel on the same object by different threads. By drawing the overall diagram carefully, especially blocking method calls and ordering guarantees, I was able to analyze the logic sufficiently to deduce a hypothesis for the likely cause of the issue (circled in red in the picture), and subsequently prove that hypothesis in the code. After that, fixing it was the easy part! It took a couple hours or so to do this work, but I enjoyed it and I was happy to have been able to provide a solution quickly.