Case Study: Blog: conquering the storage bottleneck

25 August 2003After recovering from our disk problem, I've had to get back to the daily routine of delivering storage to new products. It's amazing what a time consuming process this is, as we have to be 100% correct on our cabling and allocations. We allocate and cable during the day on our production environment. With the rate of change we have, if we didn't do this, we'd never get any work through. Documentation and audit trailing work consumes the most time, followed by liaison with other teams to co-ordinate our work. We've implemented a number of processes to simplify the interaction with other teams.

The DMX install is definitely happening. As part of ECC5 we will lose the Volume Logix commands and use the new symmacl command to administer LUN Masking. I'll write my views on that once I've got my hands on an active installation.

18 August 2003 One of our disk subsystems suffered a major problem this week. Due to a microcode bug, we lost some disks on a backend fibre loop. It's interesting to see how these problems develop and how they get resolved. Once we'd managed to isolate the real failing disk, we were able to re-created the falsely lost disks and re-establish the subsystem. My primary aim in these instances is to ensure we have no data loss. We were lucky this time, as the failed disks were part of a number of RAID-5 groups and the disk loops span RAID groups. However, data loss is the worst thing that can happen and that has to be avoided at all costs.

A problem like this brings the process of microcode upgrades into the discussion. Should we upgrade regularly or perhaps only when we are likely to suffer a known problem? It's a tricky choice and I'd probably fall on the side of regular updates. We run multiple subsystems that aren't connected so we can upgrade serially and that way detect any problems before the code is applied to all disk arrays.

11 August 2003
The last few weeks have been spent looking at a number of problems. First, we have had minor issues with our SAN. Some problems have been extremely difficult to resolve as we can't determine easily whether the problem is a GBIC issue, a cabling issue or related to the host or disk subsystem. Effectively the process is trial and error and this can be very time consuming.

Although we resolved our problem with ESN Manager and 2 directors that couldn't be added to the configuration, we still have a single director which crashes the configuration. This is a complex problem as the GBIC, port, cabling and even fibre adaptor into the disk subsystem have all been replaced, yet not solved the problem. Putting a fibre channel analyser on this port is the next stage to resolve this problem.

This highlights one very interesting scenario and that is, as the SAN grows we have an increased number of problems and so more issues to resolve. Where those problems affect a large number of hosts, for instance with a core switch, the impact to servers is substantial. This leads me to think there should be a cutoff point on the number of hosts on a single fabric infrastructure and therefore a tradeoff between management ease and problem impact.

Storage demand continues to rise. We are now looking at installing a couple of EMC DMX frames as part of the drive to meet demand. This will be interesting as it requires us to install ECC Version 5 to support the new subsystems. Anyway, ESN Manager, our current management tool can't hack the number of switches we're throwing at it and upgrade is inevitable (and probably fun too).

14 July 2003
We're about to expand our SAN again! The decision has been made and we're going for expanding our single fabric pair. Increasing demand has required a radical increase in the port capacity of our main disk SAN. Currently we have 16 McData 6264 directors in 2 fabrics. We intend to double this to 32 per fabric, with a twin-core switch design. We need to do this as the main core switch we have (we use core/edge design) can't take any more disk connections. We had 2 options, migrate to a 140 port director or install an additional core switch. The preferred route was to use twin core directors. This takes the infrastructure to just over 2000 ports. Having a twin core design wastes another set of ISLs from each edge switch. It is more wasteful than the mammoth core switch concept, however we would have the ability to partition the SAN at a future date if required.

Increasing the infrastructure will be a significant piece of work. Most challenging will be maintaining the configuration. So far we've managed this process with 100 percent success. The first piece of work will be installing the new switches and bringing them into the fabric, testing alerting and connectivity. Timescales are tight - it all has to be done by last week......

1 July 2003
I can't believe we are already half way through the year. Workload seems as great as ever and capacity and demand increase at a tremendous rate. For example, we're now managing nearly 600 hosts compared to 400 8 months ago. These hosts are using nearly 70TB of storage and another 8TB is being ordered this month. As our SAN has grown we are starting to see some difficult to solve problems which are appearing more regularly. I say difficult to solve, but they're not particularly difficult, however in an environment that is 24x7 and mission critical, taking resources down to swap components is not something you can schedule for prime time on a Monday morning. Instead we are looking at late evenings and weekend slots, with a step by step resolution plan that is taking a number of weeks to resolve. As the environment grows, this is only to be expected, however it means less time is being spent on projects and more on problem resolution.

So, as the SAN reaches the size of a 200-pound gorilla, do we let it expand to the size of an African elephant or should we give birth to a new SAN child and let them sit side by side together? At the moment I'm not sure. If we expand the current SAN, then we need to introduce additional core directors, which means ports lost to extra ISLs. If we split into two SANs, we create more manageability problems, especially for deciding how to connect our disk subsystems to each infrastructure.

The jury's still out. I think I have some more deliberating on the pros and cons before a final decision can be made. Whatever the outcome, the management challenge will remain

20 June 2003
The last couple of weeks seem to have been dedicated to resolving performance issues on our McData switches. Quality of cable seems to be the main issue. On a number of ports we receive transmission errors of varying types, mostly CRC errors. The McData switches monitor these errors and raise alerts if the error rate exceeds a pre-defined threshold. For a number of ports that serve Notes data on Win2K servers we see server freezes and at the moment this seems directly related to the errors those ports connected to our disk subsystem receive. Moving the affected hosts to another connection to the disk subsystem seems to resolve the problem, confirming the view that cabling is the issue.

That has lead to discussions on how we can best locate and remedy the errors before they cause system impact. Obviously we are happy to accept a certain number of transmission errors, however it seems that those ports which display any transmission errors currently are likely to have reported alerts in the future, simply due to the increased traffic those ports will receive as we increase our infrastructure.

So, we are looking to use the telnet CLI to obtain errors details. This has a number of benefits. First, we can set the output to be in comma delimited format, making it easier to import into a database. Second, we can reset ports stats, so once we are happy we have collected the latest details, we can reset and collect the next day. Regular collection allows us to relate issues to a particular day, or a generic trend we see with a specific port.

All of this means more scripting.... 6 June 2003
This week has been all about getting our Netapp filers to work and understanding features such as vfilers, SnapMirror and clustering. Although the features on their own are fairly straightforward to implement, the problem is integrating these features together and coming up with an applicable set of standards.
For example, resources such as volumes can be assigned to vfilers, however Snapmirroring (asynchronous data replication between filers) is performed at the physical filer level.
The discovery on SnapMirroring has led us to rethink how we assign our Network interfaces to production and management uses. Should we dedicate an interface for SnapMirroring? It's looking like we should. SnapMirroring is certainly fast, even across a 100Mb/s link, so we don't want it impacting production data access. The final test of our configuration will be to failover a clustered filer to the backup filer and replicate from either one to a third filer. That's the challenge for next week

30 May 2003
I've spent most of this week on support issues and disk allocations. Today alone, I allocated nearly a terabyte of space for new and existing hosts. I've also started to configure the Netapp Filers. I had one interesting issue this week. We'd powered up one of the new machines for testing but didn't have enough power in the installed rack to keep it running, so it was closed down and left for about 10 days. The NVRAM battery had reached a critically low status and when we eventually brought the box back up, the filer shutdown until the NVRAM battery was recharged! It appears that the battery is OK and was probably just flat. The vfiler configuration is taking some thought. Each vfiler requires a "/etc" directory and this has to be on a disk device or qtree that will not be destroyed during the life of the vfiler. I think we'll dedicate the root volume for the physical filer as the place for the qtrees and have one per vfiler on that volume.

23 May 2003
We haven't solved our switch issue. EMC aren't confident that resetting the switches will clear the blockage. They think the problem may be elsewhere on the fabric and is not a simple ESN Manager issue. Even if we wanted to, we're being recommended against power-cycling the switches due to another but that's been brought to our attention. This causes loss of connectivity to the switch due to some time counter related problem. Flipping between CTPs (the NIC interface) is supposed to resolve it.

Another interesting problem raised its head this week. A Solaris host which is booted from EMC SCSI disks crashed and we couldn't get it back online despite rebuilding the O/S. After numerous combinations of local and EMC disk and cabling swapping, we determined the problem to be an internal SCSI CD-ROM drive, which when replaced, allowed us to boot from the EMC or local O/S. The conclusion so far is that this somehow affected the SCSI bus, although any diagnostics we did showed no hardware problems. At the moment we are waiting to see if we have a recurrence of the problem.

19 May 2003
Now our Netapp filers are installed, it's time to start configuring and investigating how V-filers will work. Virtual filers operate above the physical filer level using a product called Multistore (which is separately licensed and unsurprisingly, not free). I'm still not clear how all this virtualisation is going to work - there are virtual volumes (qtrees), VIFs (Virtual Network Interfaces) and now virtual filers. Additionally, we have to ensure clusters failover to their cluster spare. We have a couple of weeks of testing (playing) to allow us to discover all of the pitfalls of bad configuration and I'm sure we'll discover them all!

16 May 2003
I'm still progressing the problem with ESN Manager and the two problem directors which have changed their IP address. It transpires that there is a procedure to ensure that the World Wide Name of a director remains the same after replacement - but that wasn't done. Deletion and addition of the directors to ESN Manager hasn't resolved the problem. I think we've ended up with a configuration were the two replaced switches have zone sets that don't match the rest of the fabric, but the zone set name on the switches are the same. Consequently, ESN Manager incorrectly believes them to exist on two separate fabrics. I'm hoping we can resolve this problem during the coming week.

12 May 2003
I moved some disks today between our "old" and "new SANs. The old infrastructureis Brocade based and we're looking to move all hosts to our new McData strategic SAN which has much more scalability and performance. We also have a problem with our management software, ESN Manager, which is refusing to view two switches which had hardware replacements and have changed World Wide Names. Consequently we can broadcast new zoning information as the interswitch links (ISLs) ensure any changes are propogated across the fabrics, however ESN Manager won't make any Volume Logix assignments to the switches which were discovered incorrectly. That led me to use the fpath command to manually set the Volume Logix assignments. It was a lot simpler than I thought and made me think this may be a better solution than performing a discovery of the environment (which currently takes 20 minutes).

1 2 Page 1
Page 1 of 2
Shop Tech Products at Amazon