INCLUDE_DATA

Events/alarms and hopefully we get the right ones.

To be honest, Storage has always been my second choice of a vocation as my heart has been with Enterprise Management Systems for a long time.  I seem to go through cycles of it along with high end Solaris/Oracle work.  I actually thought that HSSM/HTnM was a good mix with what I do now.  That was until our Management Station was decided to be Microsoft MoM and some dinky (read cheap and not well supported) management solution that was not Openview or Smarts Incharge was used.

Today I was happily zoning away (while surfing for a new job) because I get a lot more to play with this weekend and I noticed that a change in zoning had not reflected at our remote site.  After a bit of investigation I found our Cisco ONS had something wrong with it and our second fabric was segmented.  I always thought that we could get away with one fabric as everything is dual pathed.  Well, only one system did have a problem and it was non production so I thought heck, it can wait. 

Sometime later I was told that I had my ISL's back which is interesting as our networking people swore blind my switches were at fault.  So I thought I would investigate if anyone actually knew what impact this had on our organisation.  Almost no one had a clue what had happened.  Not one event or alarm filtered through to our enterprise management people.  Sure there were a couple of port failures that appeared on the Brocade Fabric watch but we get them all the time because we are a Windows shop and every problem means a reboot.  So people don't take any notice of the events.  No event said "THIS PORT IS AN ISL AND IT IS IMPORTANT". 

Without getting into details, all was sort of ok (more like complete ignorance) even though I had a segmented fabric due to zoning mismatches.  So I decided to fix the segmented fabric.  Thats where everything went pair shaped.  There was a zoning transaction in the fabric that stopped the fabric merge.  This was discovered after the config on the remote switch was completely removed or so I thought and so did the switch.  This issue meant that the switch thought it had no config so it was a completely stupid switch for about two hours while this config issue was sorted out.  Things started happening in the SAN mainly due to database servers not really handling one path going missing – note to discuss this with HDLM people.  People started getting upset and a number of sev 1 calls were raised.  So after everything was fixed, we still had no events/alarms that suggested what had gone wrong.  The Brocade switch did say a fabric mismatch had occurred but just how many people in any organisation knows what that means.

So the moral of the story is what?  Ingorance is not bliss.  The most important things will most probably only happen once every few years and there will be new people when it happens again.  HSSM was of no value what so ever.  Microsoft MOM .. don't get me onto that.

This was my first segmentation exercise ever.  The major fault was that the switch did not tell me about this zoning transaction that was stopping the fabric merge.  I will take that information to my next job.

We as an organisation are used to SAN outages because we still have IBM DS4000 series arrays…. but not for much longer.

Stephen

  • Share/Bookmark

5 Responses to “Events/alarms and hopefully we get the right ones.”

  1. SteveK Says:

    Yeah, HSSM isn't great for those sorts of alarms. We are in the same boat: port offline? 99% chance it was a Windows reboot. Ignore it.

    I think you could get a customer event filter in HSSM Setup so certain ports send different notifications. But don't ask how.

    But we use EMC Control Center for that now. You can take the Port Offline Events, and split them up, and decide on a port-by-port basis which ones should be a "Red Alert", with E-mail alerts. Takes a while to set it up, but it seems to be fine. Per default it's on for all SAN Ports, but if you have switches which are monitored by EMC's secure remote support gateway, Windows server reboots can cause calls to be opened with EMC, with the result that the planned Windows reboot on Sunday night means someone gets woken up by an  EMC support guy on the other side of the world saying "Hey, we have an alarm from a SAN Switch". After the 3rd time I decided it was time to do something, and after a talking with EMC for a while, I managed to fiddle with the Switch Events so that only ISLs and Storage System Ports generate alarms. Haven't had an ISL go down on me yet, so I can't verify if it really works yet (have to check this at the next maintenance window).

    Another option which we want to use is syslog from the switches. Brocade has a facility to log all switch events to a syslog server, then use the syslog server to sort out which messages are important. I would guess that segmentation errors would also pop up here. We still have to talk nicely to our Unix Admins to get a syslog server up and running for us though. Fabric Watch (extra license from Brocade) could also be of use here if you have Brocade. Can't comment on Cisco as I have no experience with that.

    Regards,

    Steve

  2. Ed Says:

    Can you elaborate more on your issues with the IBM side of things, specifically the DS4000 stuff?

  3. Janåke Rönnblom Says:

    So whats so bad about MS MOM?

  4. Stephen2615 Says:

    Steve,
     
     
    Nice EMC plug there. I really prefer MDS switches and feel a lot more comfortable with them over the Brocade ones but it seems to be a matter of preference.  Our HDS support folk (kudo’s to them) are mystified to why I would choose a MDS over a Brocade switch.  I said I preferred a Qlogic switch compared to Brocade and there were fits of laughter for hours until they realised I was not joking….
     
     
    I have emailed Ed directly to stop any more IBM bashing… well at least for the time being.
     
     
    MoM .. I am very happy that we have a Swedish poster.  Welcome Janåke!
     
     
    I should rephrase my comment as MoM may not really be the problem.  It is more like finding good people who understand MoM and who think (or even know of ) outside the Microsoft Universe.  We have had calls open with the MoM support for ages when it comes to simple integration with Brocade’s Fabric Watch.   Even getting simple MIBs loaded for HTnM integration is for some reason difficult.  I mentioned that I specialised in Openview and Smarts Incharge (a very very good product even though EMC owns it now) and it was relatively easy to get good events (and very accurate root cause analysis) out of those products.  MoM is obviously Microsoft focused and the same goes with HPSIM, etc.  I read that the Smarts Incharge Codebook Correlation Engine was going to be OEM’ed into MoM and I was excited.  I have not heard anymore since that dubious announcement but if it was true, life would become much more interesting in looking after events not related to Microsoft. 
     
     
    I know that systems like Openview et al are complex but considering the cost of the the systems, you would hope that they could deliver.  My manager thinks MoM is the cure for all IT related issues.  I disagree.  Give me Openview and I would deliver in a short time what the MoM people have promised for years.  I should say that IBM OEM’ed Openview in the early 90′s and created  Netview or Netview/6000 or whatever it is called these days.  Just dont confuse it with the Mainframe Netview.  It is similar (/usr/OV) but thats where it stops.  I think that IBM/Tivoli purchased Micromuse hoping that Omnibus would take over from where Netview and the Tivoli TEC failed in smaller organisations.
     
    Stephen

  5. Chris M Evans Says:

    Stephen
    Perhaps we need a specialised, dedicated SAN monitoring tool which is capable of going to the granularity of understanding the issues you were seeing.  I think even with the right MIB, the existing tools are too generic.

Leave a Reply