This time, I thought to take a different angle. Instead of reasoning about software I'll just write about specific acts of programming in specific programs. More precisely, I want to write about specific bugs that I had the (mis) pleasure to track down.
I took my first full-time programming job ten years ago, almost to the day. During this period I introduced and solved many bugs (Hopefully the difference between #introduced and #solved is not very large). My criteria for choosing which bugs will make it into this list is that of
memorability - I chose the bugs that I remember most clearly. When discussion bugs, memorability is a key factor. The more painful the bug the greater the impact it will leave in your memory. Thus, in some sense, I am about to list my most painful bugs from the last decade.
To make things interesting, this post will disclose (mostly) the descriptions of the bugs. The cause/solution will be disclosed in the next post. This will give you programmer's mind something to chew on for a few days. For the same reason I also (occasionally) omitted pieces of information whenever I felt that their inclusion in the bug description will make the solution too obvious.
#5: Only guests are welcome - 2008, JavascriptSo my partner and myself are about to demo a web app which we quickly prototyped in a couple of weeks. It is an on-line discussion system. Buzzwords used: database, web-server, Ajax, CSS. We also had a decent suite of tests.
It is 8:45am. The demo will start in 15 minutes. I am doing one last quick run of the demo. I log in as an anonymous guest. I browse through the pages and everything is fine. I try to log in as a user. Name: "Noga" (that's my dog's name). Password: "hithere". Oops. No response. Not even a "login failed" message.
I know that login is doing some Ajax-ing so I quickly try to bypass the Ajax mechanism by manually forming a login URL containing my credentials. It works. So I now know that the defect is Ajax related but I don't have the time to rewrite the login page such that it will not use Ajax, let alone test it.
8:55. The Anxiety-meter is screaming. We come up with a cunning solution. We start the demo with an already logged-in user (by secretly entering the login URL before the demo starts). We walk through all the pages. Then we log out and show that everything works for a guest user. Luckily, we are not asked to go back to a non-guest user. We are saved.
#4: The unimaginably slow load - 2005, Java
I am working on a Java mass analysis tool. It is a program that digests jar files and classifies the classes therein based on a set of rules. The program is a console app which supports operations such as these:
- import: import a new jar file into the system and give it the specified name
- load: load a previously imported jar-file into the memory
- classify: Run the classification rules on all currently loaded jars. Saves the results into the specified file (.csv)
The import command scans the jar file and transforms it into my own data structure which is optimized for the type of analysis I need to do. The data structure is then saved to the file system. The load command reads the saved file back from into memory.
For a few days I am working on adding new classification rules. When developing new rules I tend to work with a set of small Jars. This minimizes the classification time, thus shortening the feedback cycle. From time to time I find myself also changing the implementation of the underlying data structure (fixing bugs, adding new operations which are needed by the rules, etc.).
At some point I decide that my new rules are complete and I put them to work on several real inputs: large zip files with > 100K classes. I notice that both the import and load commands run much slower than before. In fact, they run so slow that they wipe out all the benefits of having my own optimized data structures.
#3: Modal smiles, Modeless cries - 2001, C
A Win32 Content editing app. User can add, edit, search for, browse through, and delete records. There is also a
Record Checker mechanism that scans all records and detects all sort of irregularities such as broken cross-record links, duplicate names, etc.
The output of the checker is a list of error messages. The messages are displayed on a new window. Clicking on a message opens the corresponding record in the main window.
Originally the error window was a
modal window: it disabled the main window. Clicking on an error closed the output window and reactivated the main window. We then decided that it makes much more sense to make this a
modeless one. I implemented this change and was astonished to see that the click functionality stopped working. Clicking on an error (in the modeless window) created an access violation and a total eclipse of the app. Remember, we are speaking about the C programming language, so expecting something as fancy as a stacktrace is out of the question.
#2: The out-of-nowhere crash - 2002, C++
A C++ Win32 app. I am initiating a
shotgun surgery that takes a few days to complete. Back then I was unaware of refactoring/unit testing (the whole company relied on manual testing) so instead of taking baby steps, I did a massive cross-code revolution.
Having battled numerous compilation and linking errors I am finally in a position where I can run the code. The thing immediately crashes. It does not even show the window. Again, the tools that we used didn't support stacktracing. I had no idea where to start.
For those unfamiliar with the Win32 technology, here's some background. In a Win32 app all GUI events of a window arrive at a single callback function which takes four parameters: hwnd - a handle to the originating widget, msg - an int specifying event type, wparam & lparam - two ints providing event specific data.
Typically, the body of such callback functions was a long switch (on the msg parameter) with a case for each of the events that the program was interested in. In this particular program the message-handling switch block was particularly long. The GUI was quite complicated and there were numerous events (~ 200) that had to be listened to. The callback function was more than a thousand lines long.
First, I tried to apply reasoning. I made educated guesses regarding which events are likely to be the ones causing the crash. After several hours of unsuccessful guesses I went to a more brutal approach: I commented out the whole switch block. This made the crash disappear but eradicated every functionality that the program had. Then I uncommented half of the cases inside this switch block. The crash didn't appear and some functionality went back on. This meant the the crash was due to the code that was currently commented.
I continued the comment/uncomment game using a binary-search strategy. Quite quickly I zeroed in on the problematic message. I placed a breakpoint and started stepping through/into the instruction. This particular switch invoked code on other functions. One of them looked like this:
bool b = false;
if(...) {
// many lines
b = true;
}
I started debugging this code. When I stepped over the
b = true statement the program crashed. This puzzled me. b is a local variable. It is stack allocated. How can an assignment to it fail?
#1: The memory monster - 2004, C#
I joined a small team working on a C# GUI app that was due to be released soon. We had a customer already using an early access version of the product in return for doing beta testing. The #1 item on our todo list was a report from this customer saying that the program becomes non-responsive after running for several hours. This is a serious defect, a real show stopper. As you can imagine, we never managed to reproduce the problem on our machines.
The release date got nearer and we still had no clue regarding the cause of this mysterious defect. As we had no better thing to do, we kept working on other items from our todo list, which was quite pathetic as we knew we will not be able to release the software with this defect.
At some point I decided to start fresh. I made the assumption that the defect was some sort of a leak.
Side note: Programmers often believe that in a garbage-collected environment memory leaks cannot occur. That's not true. A garbage collector (GC) will find all unreachable objects and will reclaim as many of them as possible. This does not mean that it will reclaim all unreachable objects. Many GC algorithms leave some of the garbage floating around for the next collection cycle. Moreover, a GC will consider something as garbage only if it is no longer reachable from your code. Thus, if your program maintains references to objects that are no longer needed, these objects will be considered, by the GC, as non-garbage. This will turn the program into a memory-consuming monster.
Such a leak often happens if you have some (software) cache in your code. The cache will keep references to objects - thereby preventing them from being collected - even if the application code no longer references them. Thus, if you implement a cache you must always implement some cleanup strategy.
I left the program running on my machine over the weekend hoping it will help me spot the leak. Sadly, when I came back to check on it, it was running smoothly. Disappointed, I sat down with the customer's contact person trying to understand how the program is being used. This conversation made me realize that the #1 thing that they (beta users) are doing much more than us (developers) is - wait for it - scrolling.
Ctrl+Alt+Delete -> Task Manager. I fired up the app, opened a data file, grabbed the scrollbar knob and started dragging it up and down. Looking at the Task Manager window I could see the Mem Usage value climbing. Slowly, but steadily. After a few minutes memory usage exceeded the main memory, the operating system started swapping and the program practically halted. This was awesome. I managed to reproduce the bug.
I opened the code that handled scrolling events (this was a custom widget with a custom data model that we developed). My eyes zeroed in on this loop:
for(int i = 0; i < rows.Length; ++i)
if(rows[i].isOutOfView)
rows.Remove(i);
Got it? Great.
Otherwise, wait for the next post...
(To be concluded)
Hackernews discussion