Code search in action

Three common cases where proper search will help

Programming is hard -- always has been, always will be. No matter how experienced you are, or what kind of tools you use, there are always problems that make you gnash your teeth and think about getting an MBA or becoming a manager.

Modern psychology says we shouldn't dwell on negatives, but I'm going to intentionally bring up three gnarly problems that we've all had to deal with. Why? It's not that I'm masochistic. I think there's a search-oriented approach that will reduce the pain, without any medical side effects.

So what are these three problems? Leading off, we've got:

I already fixed this bug once

Don't you just hate that? You fixed the bug, and then it pops up again. Same mistake, same fix, but in a different area of your code.

Or even worse, you fixed it, and there it is, still alive and creating problems in another branch. Heck, sometimes it's not even a branch -- Joe Schmoe decided that it would be a good idea to fork your code for some other project, and you find out about it when somebody figures out you wrote it so you get assigned the bug report.

But often, just like in cheesy horror movies, there's foreshadowing. The premonition that something bad is going to happen. As you fix a bug, there's the little red light blinking in the back of your head, trying to tell you that this bug exists elsewhere. Could be your code, could be somebody else in your group, or in a pile of code you don't even know exists.

You can ignore the light. Stand up, grab a cup of coffee, check e-mail, start an IM chat ... and eventually the feeling goes away. Heck, maybe the CTO will decide to rewrite everything in Ruby before the bug bites back. Kind of like ignoring the odd noise coming from the front-right fender. Most programmers are guys, and guys by definition are masters at ignoring chores.

And spending some extra time to figure out where else the bug might exist is a chore. No question about it. Then, if you find the bug, you have to fix it, or tell somebody else that it needs to be fixed. Who wants to make extra work?

But taking that extra step, going the extra mile, giving 108% -- that's what separates real programmers from code monkeys. And it's a great way to build up some good karma points, which always come in handy when you break the build. Plus you can redeem them for valuable chotchkies like bouncy balls with flashing lights inside, like the one I got at ApacheCon from Iona.

So now what, you ask? Well, if propagating bug fixes is going to suck, let's make it suck less. The easiest way to do this is to find clones -- exact or almost exact copies of the file that you just modified. Level Zero is exact matches, which works for pure clones. Krugle provides a step up from this, by removing comments and stripping out formatting before calculating an MD5 hash, so at least changing tabs into spaces isn't going to make it look like a different file.

Level 2 would be to use something like a winnowing algorithm to match files that had some minor edits, things like a few new or modified lines. We're not there yet, but getting closer. Level 3, a.k.a. Bruce Lee on steroids, would be to match up code at a function level, so that if Joe was really being a bad boy and copied code at the function level, you'd still be able to find it. Interesting techniques for doing this, but they're all pretty darn theoretical. And they wind up being computationally expensive. Or in other words, they're really, really slow.

Back to reality. What if it's not a clone? How do you find likely suspects for bug replication? Bugs come in many flavors, so here's a short list of techniques for some of the cases I've run into:

1. Assuming the bug involved the wanton misuse of an API, you could search for other places where that same function was called. If you're lucky, and the API uses named constants, then you might be able to easily find examples of the same misuse. Or you want to quickly eyeball all cases where strcat is called, since your company previously instituted a search-and-destroy approach to unbounded copy APIs like this one, but a few have snuck back in.

2. A depressingly common bug comes from missing cases in a switch statement. Bob adds a new value to an enum but forgets to update every place where there's a switch statement using that enum to trigger actions. Bob should have read the section below ("Who Am I Going to Break?"), but now it's up to you to clean up his mess. Here you can get a quick overview of likely problems by searching for +older-enum-name -new-enum-name, so that you find places in the code with one and not the other.

3. Often just plain old +/- type searches are all you've got to work with. For example, I fixed a bug where a servlet was returning content without an explicit "UTF-8" charset specified. In this situation I'd search for call to setContentType, and "text/xml" but not "utf-8". That generated a list of likely hot spots for the same bug.

Who am I going to break?

Not break as in "cause physical pain", but whose code just might not be the same after I make this tiny tweak?

We're talking about finding the inbound links here. Who depends on me? With this brave new world of loosely coupled components, you've got no idea who might be calling your service. Zip. Nada. And if the code you write is good, then more people will be using it, and you're more likely to mess things up when you change something. Congratulations.

The only way to get this nailed down is to have all the code. And I mean all the code. So you can grep it on your hard disk. Great, let's get this party started. First find every active branch from every project in every group in your company. Don't forget that separate SVN-based repo hiding off in the corner. It's being used by the programmers from the company you just ate, I mean acquired, and they'll stop using it when you pry their cold, dead fingers off the keyboard.

And to get access to all this code, you'll need to defeat the IT department's necromonger. Then you can have the information you need to install the SCM clients, configure them, and start downloading. Might want to get this running before your vacation break. When you get back, fire up grep on the megabytes or gigabytes of code, then take another vacation. You'll want to be well rested before you start working your way through the search results. Most of them will be bogus, since you don't have the advanced degree in regex that you need to limit the search to only calls to function foo that occur in files that import package bar but don't import package whatever-comes-after-bar.

Hmm, I wonder why most programmers don't spend a lot of time doing this type of dependency analysis before they make a change? I guess they're just lazy and unmotivated.

But if you make it easy enough, even slobs will try at least one search. Especially if it's fast. And the results wind up saving their bacon a few times.

So what's needed here is a way of finding other stuff that uses my stuff -- a way that's fast, easy, accurate and complete. Sounds like a job for Krugle Man, I mean specialized code search.

Building on something I mentioned in the first use case above, if you extend an enum (whether it's a real enum type, or a list of values), then you'd want to check places where the code is checking for the different values and thus might need to be extended when you extend the set of possible enum values.

For example, Lucene's field can include term vector information. Right now, there are five possible values (no, yes, with positions, with offsets, with positions and offsets). Let's say a new option was to be added, maybe a term vector without frequency information -- I'm just making this up as a go, so ignore whether this would actually be useful or not.

The question then becomes who might break because of this change. Doing a search for +WITH_POSITIONS_OFFSETS +WITH_OFFSETS -project:"Lucene Search Engine" gives me 20 hits in publicly available source code, all of which are good candidates for a code review before I go ahead and make this change.

Why doesn't this lame routine work?

Yes, the documentation says that if you pass true for the 'alwaysUnlock' parameter, it will always unlock. Except that it isn't. Unlocking. And that's just not right. You think about maybe passing false, just in case it's a total swing and a miss. But now you're in thrash mode. Might as well let a monkey type in some code for you.

Stop. Take a deep breath. Now try searching for other code in your company that calls the same routine. It's more than likely that somebody has figured out the secret incantation required to get it to work. So draft behind them, and get some extra mileage out of their hard work.

For example, in Lucene you use an IndexReader (somewhat confusingly) to delete documents from an index. So you diligently call the IndexReader's delete() method, but your document is still there. I know, let's call it twice, to show Lucene that we really, really want to get rid of it. Same result.

Since Lucene is open source, and there's a good amount of info available on the Web, you could use general search to find information about modifying an index. Or you could buy the excellent, though slightly out-of-date (let's go, Erik and Otis -- time for an update) book Lucene In Action.

For code inside your company, it's kind of unlikely that there's been a book written about it. And most programmers are too busy planning how to kill the product manager to write good documentation, even if you did have a search engine running inside your company so you could search for this non-existent information.

Which means you're kind of stuck, except you, being an enlightened programmer, can do a search on code that uses the IndexReader class and calls the delete() method. With a few quick queries, you find Bob's code that makes use of IndexReader to update the index, and you see that he religiously calls the close() method after doing a delete(). You add that line in your code, between the delete and the code that verifies the document is gone, and things now work as expected.


The three examples above are common use cases that we see both internally and when talking to customers. By properly using search, you can spend less time being frustrated and more time being a happy, productive member of society.

This story, "Code search in action" was originally published by LinuxWorld-(US).

Copyright © 2007 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon