Monday 18 November 2013

Hunting bugs in the jungle.

Problem: Some bugs are like ghosts – They are only seen rarely, but they can be quite a scare.
Some bugs are illusive, and (almost) impossible to recreate, but it does not mean that they are not there. It is most likely that these bugs are considered to be ‘not reproducible’, but they are lurking under the surface, just waiting to ruin your day - or the release you have been hacking on for the past weeks.



Solution: Team up, and gather all the information you can get – Then eliminate the possibilities.

We have recently been puzzled by a bug, one of those that require a lot of investigation to reproduce and document. The root cause was really simple, but pinpointing the origin and reproducing it was hard.

The problem was caused by faulty population of dropdown that dictated what data the user was working on in the system. The selection made by the user was supposed to be persisted in the session and saved in the database, to ensure that the selection stayed even if the user quit and reentered the application.

While executing the test we discovered that the selection changed – Apparently for no reason. It happened a couple of times and we started discussing the observation, and agreed that this was indeed a bug. It was a bug that we could not recreate, nor could we point at a single plausible root cause.

We did two things in order to start the hunt for the bug – First we recorded everything we knew about it in our bug-tracking tool, then we teamed up in order to make a bug-hunting crew. We asked the developer who wrote the code to assist with two things; First thing was to participate in the discussion on root cause theories, and the other was to enable all logging, and stand by for the next sighting of the bug.

We used the theories to guide the testing, as we performed the scenarios believed to lead to the reproduction of the bug. One by one, the theories were dismissed as root cause, until we had the bug cornered.

When we finally encountered the problem again we had lots of logs to look for and less possible root causes to check – That made it much easier to find the bugger and recreate it based on the information we had available.

It turned out that the session handling was not correctly set, making the population of the dropdown to take place before the value was fetched from the database. It only happened on the rare occasion when user was transferred from one instance in the cloud to another. This was the reason for the illusive nature of the bug.

Conclusion: Use root cause guessing/analysis as a guide and your peers to help you when hunting those illusive bugs.

Happy bug hunting!
/Nicolai

No comments:

Post a Comment