How much are you using now
Identify the limiting resources for each service. Your monitoring system is likely already collecting resource use data for CPU, RAM, storage and bandwidth. Typically it collects this data at a higher frequency than required for capacity planning. A summarization or statistical sample may be sufficient for planning purposes and will generally simplify calculations. Combining this data with the data from the inventory system will show how much spare capacity you currently have.
Tracking everything in the inventory database and using a limited set of standard hardware configurations also makes it easy to specify how much space, power, cooling and other data center resources are used per device. With all of that data entered into the inventory system, you can automatically generate the data-center utilization rate.
Normal growth
The monitoring system directly provides data on current usage and current capacity. It can also supply the normal growth rate for the preceding years. Look for any noticeable step changes in usage, and see if these correspond to a particular event, such as the roll-out of a new product or a special marketing drive. If the offset due to that event persists for the rest of the year, calculate the change and subtract it from subsequent data to avoid including this event-driven change in the normal growth calculation. Plot the data from as many years as possible on a graph, to determine if the normal growth rate is linear or follows some other trend.
Planned growth
The second step is estimating additional growth due to marketing and business events, such as new product launches or new features. For example, the marketing department may be planning a major campaign in May that it predicts will increase the customer base by 20 to 25 percent. Or perhaps a new product is scheduled to launch in August that relies on three existing services and is expected to increase the load on each of those by 10 percent at launch, increasing to 30 percent by the end of the year. Use the data from any changes detected in the first step to validate the assumptions about expected growth.
Headroom
Headroom is the amount of excess capacity that is considered routine. Any service will have usage spikes or edge conditions that require extended resource usage occasionally. To prevent these edge conditions from triggering outages, spare resources must be routinely available. How much headroom is needed for any given service is a business decision. Since excess capacity is largely unused capacity, by its very nature it represents potentially wasted investment. Thus a financially responsible company wants to balance the potential for service interruption with the desire to conserve financial resources.
Your monitoring data should be picking up these resource spikes and providing hard statistical data on when, where and how often they occur. Data on outages and postmortem reports are also key in determining reasonable headroom.
Another component in determining how much headroom is needed is the amount of time it takes to have additional resources deployed into production from the moment that someone realizes that additional resources are required. If it takes three months to make new resources available, then you need to have more headroom available than if it takes two weeks or one month. At a minimum, you need sufficient headroom to allow for the expected growth during that time period.
Resiliency
Reliable services also need additional capacity to meet their SLAs. The additional capacity allows for some components to fail, without the end users experiencing an outage or service degradation. The additional capacity needs to be in a different failure domain; otherwise, a single outage could take down both the primary machines and the spare capacity that should be available to take over the load.
Failure domains also should be considered at a large scale, typically at the data-center level. For example, facility-wide maintenance work on the power systems requires the entire building to be shut down. If an entire datacenter is offine, the service must be able to smoothly run from the other data centers with no capacity problems. Spreading the service capacity across many failure domains reduces the additional capacity required for handling the resiliency requirements, which is the most cost-effective way to provide this extra capacity. For example, if a service runs in one data center, a second data center is required to provide the additional capacity, about 50 percent. If a service runs in nine data centers, a tenth is required to provide the additional capacity; this configuration requires only 10 percent additional capacity.
The gold standard is to provide enough capacity for two data centers to be down at the same time. This permits one to be down for planned maintenance while the organization remains prepared for another data center going down unexpectedly.
Timetable
Most companies plan their budgets annually, with expenditures split into quarters. Based on your expected normal growth and planned growth bursts, you can map out when you need the resources to be available. Working backward from that date, you need to figure out how long it takes from "go" until the resources are available.
How long does it take for purchase orders to be approved and sent to the vendor? How long does it take from receipt of a purchase order until the vendor has delivered the goods? How long does it take from delivery until the resources are available? Are there specific tests that need to be performed before the equipment can be installed? Are there specific change windows that you need to aim for to turn on the extra capacity? Once the additional capacity is turned on, how long does it take to reconfigure the services to make use of it? Using this information, you can provide an expenditures timetable.
Physical services generally have a longer lead time than virtual services. Part of the popularity of IaaS and PaaS offerings such as Amazonís EC2 and Elastic Storage are that newly requested resources have virtually instant delivery time.
It is always cost-effective to reduce resource delivery time because it means we are paying for less excess capacity to cover resource delivery time. This is a place where automation that prepares newly acquired resources for use has immediate value.
Advanced capacity planning
Large, high-growth environments such as popular Internet services require a different approach to capacity planning. Standard enterprise-style capacity planning techniques are often insufficient. The customer base may change rapidly in ways that are hard to predict, requiring deeper and more frequent statistical analysis of the service monitoring data to detect significant changes in usage trends more quickly. This kind of capacity planning requires deeper technical knowledge. Capacity planners will need to be familiar with concepts such as QPS, active users, engagement, primary resources, capacity limit and core drivers.
This excerpt is from the book The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems Vol 2 by Thomas A. Limoncelli, Strata R. Chalup and Christina J. Hogan, published by Pearson/Addison-Wesley Professional. Reprinted with permission of the authors and publisher.