Cyber-crooks are automated; you need to be, too

It’s time to automate security response, says the CSO of a $1.6 billion company, who swears by a new tool he has deployed

Golan Ben-Oni, CSO and SVP of Network Architecture at IDT Corp., is responsible for protecting the infrastructure of the diversified company’s telecommunications, payment, energy and oil businesses, which employ 1,700 and include 12,000 endpoints worldwide. Automation is key, he says, because the attackers have upped their game. Ben-Oni shared the story with Network World Editor in Chief John Dix.

Golan Ben-Oni, IDT

Golan Ben-Oni, CSO and SVP of Network Architecture, IDT Corp.

Let’s start with a thumbnail description of your company.

IDT and its affiliated firms are involved in telecom, payment services, oil exploration, energy supply and entertainment. We started in telecom, and that’s what IDT is known for, but we have since entered the vertical markets of energy and oil, and banking and finance. We’re doing shale oil in Israel specifically, and on the finance side we own banks in Europe, so we see our share of state-sponsored threats. The energy and oil business was recently spun off as a separate company, but it’s in our building and some of us have shared responsibilities. I’m responsible for the architecture of their security environment, although we’re growing and the companies are getting bigger independently.

So with energy exploration and banking concerns you’re a rich target?

Yes, that’s correct. And the reality is our adversaries are numerous and quite good, which brought us to the need for speed and for automation because I look at it in the context of a battlefield. In the beginning, adversaries were lazy because pretty much anything they did worked. They didn’t have to try very hard, and most of the time they weren’t even noticed. They could live inside an organization for weeks, months or years and just collect more and more intelligence.

Our first order of business was just gaining visibility. What that meant was we had to find best-of-breed vendors, and sometimes two or three that do exactly the same thing. I’ll give you an example. Years ago we brought in FireEye, which told us a bunch of stuff, but of course the adversaries started figuring out ways of defeating specific implementations, so we took a cue from the NSA which believes you should deploy three of everything, so that’s what we did.

We deployed FireEye, Palo Alto Networks and Fidelis Network Systems, just on the network piece alone. Then we did the same thing on other components of the environment, like the endpoint and the user analytics space.

But we ended up with a lot of product and very little interoperability, so we inserted people in between to deal with the alerts and events and trying to glue it all together into a cohesive story. That took a lot of time from an incident response perspective, but we did gain visibility. Too much visibility, though, can harm you, especially if it’s repetitive.

One of the things we had to work on was getting everything into one place. Traditionally people use SIEM tools for this and we’ve tried many. We started with RSA, moved to Nitro and now we’re very heavily focused on Splunk. One of the key differentiators with Splunk is it’s fast and digests all kinds of information. You don’t have to spend a lot of time in professional services getting it to digest data.

In 2013 my number one pain point was, “How do we gather all this data and do something about it without having to get a person involved?” Many of the alerts were clear: This machine is infected. There’s not a lot of thinking we need to do. We have to investigate the system, pull off forensic data, move it into a remediation network where it won’t harm other components of the environment, then wipe the system and get it back to the user because they’ve got business to do.

So in 2013 we did a very simple use case with Splunk with some help from our friends at Palo Alto, and it was the beginning of what became our automated incident response methodology. It was still in its infancy, but we got a lot of positive feedback from other big organizations that really wanted to do it.

But security is not our core business. We’d rather share our efforts and designs and strategy with a vendor that can go implement it. We know what to tell vendors to do, but we would rather they go and produce something that’s supportable and generally available and so on.

Had you implemented your own tools?

Yes, in the beginning we even had to integrate tools that didn’t have APIs. We were doing crazy things like logging into databases and inserting data in ways that the product was never meant to do, just to get things working. We did it with AccessData, we did it with Mandiant, and Mandiant in particular was saying, “It’s interesting, but our customers aren’t interested in automation.”

I am always on the search for vendors that get the vision, understand its importance irrespective of what their other customers are saying, because we’re pretty sure where things are headed and it goes back to this battle we’re dealing with. Although attackers were initially lazy, they have since started using automated tools. So if you’ve got adversaries with automated tools on the one side and we’re running around on the other side with sneakers on our feet, that’s just not going to work. It’s not a fair battle. It’s very hard to deal with an army of automated robots.

Is there a way of putting the size of the problem in perspective?

We’re looking at everything that goes on in the network, everything that happens at the OS level, any kind of changes that happen in the file system, if there are new files that get dropped or files that get loaded, if there are mobility events. We essentially stream all of this in real time into Splunk, so it very rapidly becomes a big data problem. I think we’re indexing about 500GB a day, but we’re scheduled to go up to 5TB once I get to the level of logging I’d like.

Five terabytes per day?

Yeah. Keep in mind we’re logging literally everything that’s going. We need complete visibility, and just the Palo Alto firewalls alone contribute 200GB a day. So indicators of compromise get fed in from lots of places, not even including the third-party IOC feeds we get about things happening in other people’s networks.

All of this comes in from the network side, from the endpoint side, from the user analytics side, from the threat side, so the question is, how do you get all this stuff to work together? In the beginning we had to integrate the tools ourselves, and I realized this is not the kind of work we want to do. I would rather present the problem to an organization that can run with it.

Hexadite emerged, and the interesting thing about Hexadite is they understand the kinds of issues we’re dealing with because they come out of the highly targeted state of Israel with people that used to work in intelligence. They took this thing by the horns and said, “What are the use cases?”

We had already solved the basic use case of what happens when an active threat comes in that we know is bad: We go through this remediation cycle -- we get it off the network, pull the forensic data, wipe the machine and restore it. The harder problem is what happens when you’ve got an indicator and don’t know what it means, or maybe it’s a weak indicator.

Maybe you get an indication that something bad flowed into your organization but you don’t know if the endpoint executed it or not. Maybe you saw a credential being misused or that someone tried to log in from a system they don’t normally log in from, but it was very low activity, maybe just once or twice. That’s not the kind of thing a SIEM is necessarily going to bubble up, especially if there are millions of events happening.

When we first turned some of these systems on we were getting 15,000 events a day. That’s not something a human can deal with, so we had to tune it. The point is that everything needs to be investigated, absolutely everything, even the things that may only happen once or twice. They may be your most important indicators but, because you’re trying to do things with people, you’ll never get around to them or won’t even notice them in a rash of events.

It’s common, after all, for attackers to run interference. There may be 15 people working on capturing an organization and 10 of them just generating noise to distract your SOC from what they really need to be paying attention to. So this is where the need arises. You have to investigate everything and, if you’re going to use people, you’re never going to get it done.

And Hexadite helps how?

So what Hexadite will do for us is sense something has been triggered. Say we get something from WildFire that says a malicious binary floated into the organization, so an automated investigation is kicked off: Go look at that machine quickly. Find out whether it executed. Find out whether or not there are other things on the machine that shouldn’t be there. If we determine that it was adware or not as malicious as we thought, then we just clean off the system and return it to service.

That whole process takes about a minute. In the traditional incident response mode it would probably take 10 to 15 minutes for the correlation rules to kick in in our SIEM, then another eight or nine minutes for an operator to see the alert and try to understand some contextual information before picking up the phone to call the network team or the systems team or whomever to start to deal with isolation or investigation.

As you step through the manual process you go from minutes to hours, days even. A standard investigation done with a person on just one machine, that’s going to take hours. What happens when there are 50 machines in your organization that just got targeted? What are you going to do when the malware is polymorphic and it looks different every time? These are real live challenges that we were faced with and we realized we couldn’t throw people at this problem. That’s not possible; hence, the strong argument for automation.

Sometimes the indicators aren’t clear and they’re just a hint of something, so we’ve got to go into the systems and collect more contextual information about what happened. Maybe the alert isn’t a big deal so we’ll just shift that system into temporary remediation, go do an investigation and return it to service.

But what if it was serious? We’ll start to see things about the way that machine has acted. Maybe it started communicating to things it doesn’t normally communicate with and we’ll need to pull those IPs and go investigate the secondary systems. Or maybe we’ll see a user ID that’s starting to be used unnecessarily or in a way that isn’t normal, so we’ll need to investigate the machines that user ID may have touched.

This is all possible through automation, whereas in the past, we were doing this with people and people are people. You may have a good guy on staff but he may be too busy to get to everything. You may have a new guy on staff, so it’s not consistent either. Certainly they can’t be as consistent and they certainly can’t investigate everything.

The basic idea is to automate what you can, to enlist the services of CPUs that can handle billions of operations per second, and free up people with the neurons. Then you end up with an operations center that is really world class. That’s the goal in all of this.

Where does Hexadite plug in?

Hexadite is software that can be run on an appliance -- we do everything in virtual systems – and it’s reading data out of Splunk, looking for specific things. So Splunk can receive an alert from one of our security tools and initiate an automated investigation based on those alerts. The alert may come from WildFire, so Splunk and WildFire will be combined on that, then Hexadite will come in on the containment side because we’ll initiate a change policy on the firewall to say “These IPs are implicated, they talk to no one.”

Hexadite then starts pivoting, looking for additional data that may be correlated to that event, looking for additional hosts that need to be investigated. It goes out and runs and starts to generate information based on what it finds. For example, it can install a micro applet on an endpoint if it needs to analyze that system. So whereas you’re talking hours or days traditionally, now within a minute or a minute and a half we’ve accomplished something.

What types of things will you allow it to do on its own versus stuff that you wouldn’t allow it to do?

That’s really a policy decision. The key is to develop an asset database and then you can have different policies for different assets. For example, if it’s a laptop that belongs to a casual user, you’re going to have a set of policies about that. You may have a separate set of policies about a file server. If there’s only one file server and you take that out, you may affect 600 users, but if there are two and one backs up the other, then you can feel confident about remediating that system.

Here it’s important to have a contextual database, and the way we initially structured it is, at least for Windows systems within the Active Directory infrastructure, we classify servers and hosts by group and then you can have policies that are deployed differently.

Related:
1 2 Page 1
Page 1 of 2
It’s time to break the ChatGPT habit
Shop Tech Products at Amazon