I have until recently been a UNIX person with some exposure to HP Non Stop and very rarely had any issues with the SAN's I managed. Each system was a "standalone" system and I include the SunFire Midrange and High End servers. Some might not class them as standalone because of their domain abilities but for this exercise, I do.
However, since I started using HP Blades in B Class enclosures, I get many more issues than in standalone systems as mentioned above. A report from Brocade suggested that a lot of systems were having issues and I looked into this. One rack in particular had a number of blade enclosures that kept reporting CRC errors. As I moved about the enclosures, I started seeing a pattern of issues start to appear. I just don't see these in standalone servers. The funny thing is the blades and the standalone servers share the same Fobots and SAN switches and Storage.
Whenever I get a problem with a blade server, it is a nightmare to determine what is the issue. After many days of investigation, I have found that moving the blade to another enclosure will fix the problem (at least for the time being). So, what's the common denominator here? If I move the blade, that rules out the Qlogic Mezzanine cards and the Fobot/Switch infrastructure. I tend to move the cables and SFP's to the new enclosures so that does not seem to be the issue either.
So, that leaves a couple of things. The enclosure backplane or those strange little fc hubs that are used in the enclosure and last but not least is the card that sits in the network switch that gives SAN abilities to the enclosure. As it is difficult to replace any one of those items without an outage, I normally give up and don't put any blade into the enclosure that requires SAN.
I also found that systems under stress are more likely to have CRC errors than idle systems. I have seen so many CRC errors that the server just gives up but I don't see the switch reporting issues on the port the server uses.
We have a bunch of new C Class enclosures that have a very different way of providing SAN but they are still in a test environment and I can't comment on them.
So, am I the only person in the world that sees SAN related issues endemic in HP B class enclosures? Over the last year or so, I have seen many 10's of these issues compared to rarely seeing one per year in a standalone environment with even more servers that the blades in production.
So, two jobs (first using Linux and second using Windows) and many enclosures later, I still see issues. Or am I just imagining things?