# Bayesian Logic And Filters

Listen to the Computerworld TechCast: Bayesian Logic

Some say that if you can't measure something, you're not doing science. Bayesian logic offers a way to measure things that were previously unmeasurable, allowing us to test hypotheses and predictions and thereby refine our conclusions and decisions. Bayesian filtering is a hot topic in the area of spam control today.

More

Computerworld
QuickStudies

Basic probability is simple to calculate, because you're dealing with a limited number of factors and possibilities. Let's consider a horse race with 10 horses entered. If that's the only information we have on which to base a wager, then we could pick any horse on the basis that its chance of winning is 1 in 10, or 0.10. Take that kind of math to the track, however, and you'll quickly be separated from the contents of your wallet. The real world is far more complicated, and here's where Bayesian logic comes into the picture.

In fact, each of the 10 horses has already run at least a few races and therefore has a history. If Lightning has won every race he has entered, and Thunder has lost every one he has entered, then we've got a real evidential basis on which to bet on Lightning instead of on Thunder.

In fact, there's a lot more information available about every horse in the race. We know or can easily find out the following:

Lineage: Is this horse the offspring of a champion? How have his brothers and sisters performed?

Performance under different weather conditions: If it rains in the morning and the track is soft, how does that affect his speed?

Position on the track: Is our horse next to the rail or on the outside? And how does the horse react when he's in that position?

Length of time since last race: If the horse ran a long, hard race yesterday, how well is he likely to run today?

Distance of today's race: How has the horse fared at this distance in the past?

Other people's betting patterns also come into play. They don't affect how well a horse will perform, but they have a clear impact on the size of the payoff if he does win.

All of this information can help us make a better estimate of our horse's chance of winning than the simplistic 1 out of 10. Analyzing these factors is a Bayesian process.

Similar things are happening in the world of Major League Baseball -- ever the province of voluminous statistical records. Team owners and general managers are using Bayesian analysis when they study the way players perform under various conditions and in specific situations and factor that information into their decisions about the players they want to draft or seek in trades.

Bayesian Antispam

The application of Bayesian logic to the spam problem got its start in Paul Graham's 2002 paper "A Plan for Spam" (www.paulgraham.com/spam.html), an approach that was soon adopted by numerous developers. Bayesian spam filtering is based on the notion that the presence of certain words will indicate spam, while other words will identify a message as legitimate. It has that in common with other types of scoring-content-based filters, but with the added advantage that Bayesian filters create their own lists of telltale words and characteristics rather than working from lists created manually.

A Bayesian filter starts by examining one set of e-mails known to be spam and another set known to be legitimate (the prior knowledge). It compares the contents for both sets -- not just the message body, but also header information and metadata, word pairs and phrases, and even HTML code for information such as the use of specific colors. From this, it builds a database of words, or tokens, with which it can usefully identify future e-mails as spam or not.

Bayesian filters take into account the whole context of a message. For example, many spam messages contain the word free in the subject line, but so too do some legitimate messages. A Bayesian filter notes this word but also looks at other tokens in the message, because falsely identifying a real message as spam (called a false positive) causes more problems than letting some spam through as legitimate.

According to proponents, less than 1% of the messages identified as spam by Bayesian filters are false positives.

The Bayesian spam filter's real power, however, lies in its ability to learn: As the user tags new messages, the filter updates its database to identify new patterns of spam.

Kay is a Computerworld contributing writer in Worcester, Mass. Reach him at russkay@charter.net.

``` ```