Dmitri Alperovitch talks about reputation-based spam protection

ITworld.com –

With about 90 percent of all emails today being spam, it's hard for even the best anti-spam program to keep up. And what's worse is that spammers are constantly developing new techniques, such as image-based spam, to sidestep the filters. What if you could determine ahead of time, the intentions of everyone who sends you an email? Wouldn't it be wonderful to know, without a doubt, who the bad guys are? What if there were a central authority that knew the reputations of everyone who has ever sent an email? As it turns out, you don't have to be a mind-reader. Reputation-based security is very similar to what the financial services industry has created with credit agencies. Every person who has ever paid a bill or used a credit card has a credit score-a credit reputation, if you will. When you want to buy a new car, the finance company looks at your reputation, and then decides whether or not to let you past the gates and give you some money. We now have the same thing for people who send emails.

Dmitri Alperovitch, Chief Research Scientist at Secure Computing and developer of reputation-based security, talks about the evolution of spam, the next big thing in spam prevention, and how to identify the culprits before they bombard your email server.

Where did reputation based security come from?

It was an invention of CipherTrust, and since then, a variety of other companies have applied it to the email security area as well. When we were working on spam detection at CipherTrust, which was bought by Secure Computing, we realized early on that it made sense to aggregate information on a global level and to collect data from all of the customers that we had deployed at that time, and apply behavioral techniques not on an individual box, but on the cloud where you have a much broader view into email traffic.

You're the point man on reputation system development. What led to your development efforts in this direction? Was there an "aha" moment when you knew that you had to create a reputation system?

Spam was starting to take off in the early 2000s and we were developing antispam technology fairly early on. We realized that a lot of this analysis that we were trying to put on the spam gateway itself would really work much better if we had a view into more traffic. The only way to do that is to put it in the cloud, and have all of these devices talk to the cloud, report what they are seeing, and get information back from the cloud. So that was the "eureka" moment that we had, where we said, "hey, let's try to get as much data as possible." And the only way to do that is through this centralized authority. It's very much akin to a credit agency. If you're a store and someone comes in to apply for credit, you can look at the local history that you may have on that person, but that can only be so effective if this is a new customer or a customer that's purchased only one or two things from you. But if you aggregate together with all the stores in the nation, which is what credit agencies do, you can build a much more accurate profile. That was the approach we took with this system.

And at that point nobody else had ever done that yet.

Exactly right. The blacklists were out there but they were not really doing the analysis globally; they were distributing the information globally, and that was the difference.

How does reputation technology differ from a real time blacklist?

It differs in a couple ways. Just on the most basic level a blacklist is just that, it's a list of malicious hosts, so it has no view into legitimate traffic. They usually have a pretty high false positive rate. I'll give you an example. Hotmail is one of the top spam centers out there. They send out a lot of mail and a decent percentage of it is malicious. Spammers relay frequently through Hotmail accounts that they're able to register automatically. Of course you don't want to block Hotmail because you know that there's a lot of legitimate content that is originating from it. A blacklist would not have any view without manual intervention that Hotmail is legitimate. A reputation system, through the analysis of both legitimate and malicious traffic, would know that, and would know that it needs to assign a neutral reputation to a host like Hotmail. The second difference is how a blacklist is generated. Most of the blacklists out there work in a very simple fashion, people get spam, they submit it manually to a blacklist operator, and they put the sender of that particular spam message on the list. The reaction time is fairly slow, so it's not real time analysis of the traffic. And secondly, most blacklists suffer from the problem of delisting. Once you list a spam center on the blacklist, how do you know when to delist them? Again you have no view into the sender's legitimate traffic, so you don't know when they actually stop sending the spam. If it's a compromised machine maybe it's already been cleaned up and it's now sending legitimate mail. And a reputation system would know that, because it's seen that traffic. A blacklist has no concept of that.

What were some of the biggest challenges in the early stages of development that you came across in creating the reputation system?

The challenges were really scale and real-time analysis. You're processing billions and billions of messages daily, responding to those queries in real time and doing all of this very intensive analysis on it. Some of the problems that we've had to solve is how to do that in a redundant fashion, so that we had 100% uptime for the system, which I'm happy to say that we've had. And, how to store all this data. A lot of the storage providers out there are very happy to have us as customers because we do spend quite a bit of time and money on lots and lots of storage, and software to analyze all those records in real time.

Is reputation based security the next big thing in spam prevention?

I think it is the next big thing in security in general. If you look at how spam detection has evolved, reputation technology is certainly one of the breakthroughs in that field in that it allows you to very quickly and effectively identify most of the spam outbreaks due to what is known as the network effect. This is the ability to not just deploy technology on individual endpoints, but to aggregate that information on a global level, have a view into billions and billions of transactions that are happening on the Internet on a real-time basis, analyze those transactions and develop a reputation for all of the mail centers that are out in the wild. You can also apply the same principles to other types of network traffic. We will be announcing a new launch of the Sidewinder firewall next month that will include it on a network level. So we will now be assigning reputations to all of the machines out on the Internet that are trying to connect to our customers, or that our customers are trying to connect to, on a variety of different protocols.

How granular does it get? For example, is there a database floating around out there somewhere that says, if you get an email from Dan Blacharski it's okay to look at?

Yes. Initially when we developed this it was based on IP addresses that sent mail. Since then we've expanded it to a variety of different identities. So for example we are now assigning it based on the message content itself, which has a reputation. We're doing it on email addresses, so your email address will have a certain reputation. We're doing it on domains, we're doing it on URLs in the context of web security, and we're doing it on IP addresses in combinations with protocols for the network security we're doing at the firewall level. So an IP address may have one reputation as it's trying to send mail, it may have another reputation as it's trying to host a web site, and it may have a totally different reputation as it's trying to run a DNS server or an FTP server. The context in which the reputation has been calculated now matters a great deal. For example, a malicious host may be part of a botnet that's used for spamming but nothing else. And if you are the owner of that particular machine that's been compromised, you still want to go to web sites, and you still want to be allowed to go to Google or Amazon, even though you have this malicious malware that's present on your machine that's doing these nefarious things. So we want to make sure we block those bad things that are emanating from your machine, but still allow you to do other legitimate tasks out on the Internet.

How does the reputation system go about determining the reputation of so many millions of different senders and entities?

It's really based on real-time analysis of the traffic. Any time one of our devices receives a connection or generates a connection to an Internet host, it sends a query to the TrustedSource database. This database is distributed around the world. We have eight data centers around the world that are hosting the service and are synchronized with each other, so when you query one you'll get the same answer as when you query any one of them. All these queries essentially tell the system the activity of various hosts that are out on the Internet. So for example, if you send an email to me, a query will be performed on TrustedSource, and TrustedSource will immediately know that you are a mail sender. It has access to the historical database, going back since the beginning of the system, of how you send mail, who you send mail to, and whether you send other types of network traffic. Based on that historical data and the real time information it's getting, it is essentially calculating a risk score-- a profile for you of whether we can expect malicious or legitimate content to originate from you.

So you have a huge database and collection points all around the world. Are some entities still slipping through the cracks?

Absolutely. No system is 100 percent effective, so certainly it will not prevent all spam from coming through, but we do pride ourselves in extremely high effectiveness levels. Our average effectiveness across the customer base is 99.8 or 99.9 percent, so very very little gets through. And one of the advantages of the system is not just in the high levels of effectiveness that it can provide but also the fact that you can reject a lot of this content at the connection level, so you can save the resources of your email gateway by rejecting those connections without having to accept the mail.

What about the possibility of false positives?

One of the unique things about a reputation system is not just its ability to identify the malicious content, but also its ability to identify the good centers and the good website hosts. And that can dramatically reduce your false positive rate because we know for example, if you, Dan, have sent legitimate traffic previously. That basically lowers your risk profile because we know that we can expect to see legitimate content from you in the future. And because that behavioral analysis can be applied to both malicious and legitimate entities that are out on the Internet we can provide extremely high levels of accuracy and reduce the false positive rate that most antispam systems out there suffer from greatly.

Since those early days when spam first started to pop up and on a continuum to today, how has spam changed since that time?

It has changed in a couple ways. In the early days we didn't really have to worry about botnets for example, and now they're a major plague on the Internet. It used to be that spammers were renting servers and using them continuously to send spam, and you didn't have to spend a lot of time and effort to detect those servers, it was fairly obvious what they were. Nowadays they're infecting anywhere on the order of 250,000 machines every single day around the world, using them for very short periods of time to send spam, and then allowing those machines to stay dormant for months before starting to use them again. So you have to react much more quickly, you have to worry about the fact that there are all those compromised machines out there that may be sending legitimate mail as well, so you have to treat false positives in a much more careful fashion than you used to. And of course the content has changed so much more as well. It used to be that spam was just text messages, and now we're seeing images, we're seeing videos, audio files being sent as attachments, so that it's not just the propagation method, but the delivery mechanism itself has changed drastically.

What is image based spam and does that present any special challenges to the reputation system?

Related:
1 2 Page 1
Page 1 of 2
  
Shop Tech Products at Amazon