INCLUDE_DATA

USP-V and Hitachi High Availability Manager

Well there has been a lot of confusion the past couple days about the latest announcement from HDS about clustering of the USP-V.  I hope to clear up some of the technical details with this post, but I’m sure there will be some questions that remain.

First, lets talk about why someone would want to cluster the USP-V.  For me there are only two reasons to do this.

First, if I have a box that has been on the floor for 3+ years and I’m having to pay maintenance on it, it might make more sense for me to buy a new box with more capacity and faster "gazintas and gazouttas".  In order to do that I’m going to need to migrate all of my data/information off of the old system.  Well what better way to do that than to connect my new box to the old one and migrate everything non-disruptively to my applications?  When I’m done, I scrub the old box and out the door it goes. 

Secondly, say I work for a company that can’t have any downtime at all.  Stop counting the .9′s of availability, I want 100% availability!  Well I can now cluster my USP-V (which was already in the 6 – 7 nines anyway) and have 100% availability for my applications.  The mainframe can do it already, why can’t my open systems have it too?  Well now they can.

So let’s look at how these things are done.  Looking at the diagram below here is how it is broken down.

  • Physical capacity can reside on internal or external disk.  (P-VOL/S-VOL)
  • The P-VOL and S-VOL have the same VOL ID in the SCSI Inquiry.  RCU takes the MCU serial number upon failover.  (Nice aye?)
  • External volumes must be mapped to both storage controllers.
  • A quorum drive is created on both subsystems for checking whether there are any data differences between the MDKC and RDKC in order to insure data consistency. 
    • You can have up to 32 Quorums per USP-V. 
    • They can be from 38MB to 4TB in size. 
    • They can reside on any currently supported device.
  • Alternate path software recognizes the P-VOL and S-VOL as the same LU on multiple paths.  P-VOL paths are the owner, S-VOL paths are the non-owner.
  • Write data is transferred from P-VOL cache to S-VOL cache without being destaged from S-VOL in normal condition.
  • When I/O’s fail to all P-VOL owner paths, alternate path software issues I/O to the non-owner S-VOL paths.  Storage controller stops the copy of cache data and the S-VOL becomes write enabled.

So hopefully that answers a lot of questions people have had.  I think it’s really simple.  HDS has had this functionality built into the USP-V since day one.  The microcode has been there (LDKC anyone?).  The hardware port(s) have been there as well.  Ever looked at a USP-V and asked youself "I wonder what this port is for that nothing is plugged into?"  ;)   I really don’t think this is any coincidence that HDS is realeasing this at this time since a lot of the boxes that were bought when these went GA are going to be coming off of lease soon.  I really don’t think that there is anything else behind this announcement.  Sorry to disappoint anyone.

A few other notes:

  • A USP-V and an HP XP24000 will be able to hook together at GA time.
  • This will work on all supported RAID types.
  • CVS volumes are supported.
  • All currently supported Open Systems OS’s are supported.  Mainframe isn’t today.
  • You can have up to 64 physical paths between boxes.  Fast huh?
  • Max distance is the standard True Copy Sync distance.  After all we’re only doing syncronous replication here with some software to manage it.
  • You can create up to 64K pairs between devices.
  • Share/Bookmark

4 Responses to “USP-V and Hitachi High Availability Manager”

  1. Tony Asaro Says:

    Snig - great analysis. I agree that High Availability Manager is extremely useful for data migrations and for companies that want 100% uptime. One of the core tenants of any technology is to automate otherwise mundane, complex and error prone manual processes and that is exactly what this has done. I’ve spoken to the Hitachi field on this and it is a no-brainer for them – essentially reinvents data migrations in the Enterprise and offers a 100% application uptime option.

  2. Chris M Evans Says:

    Snig,
     
    Good post, but it raises even more questions.  Firstly, 100% availability in an array is no use if the remaining components don’t also offer that same redundancy; servers, switches, IP network and so on.  That’s not easy to achieve and not cheap and therefore there will be few customers who need that. 
    Migrating from one array to another after a 3+ year cycle is rarely done without some level of re-stacking or re-organisation.  H-HAM doesn’t help in that scenario.
    Does this configuration operate in a remotely replicated scenario where TrueCopy is already being used to another datacentre?  I doubt it will. Surely those customers who crave 100% availability also want the array to replicate to another location synchronously too?
    This configuration seems to be active->passive and only active->active if you configure some LUNs for a host to work in each replication direction.  If so, then there’s extra management overhead required to make this work.
    You describe the process of "failover".  What instances this?  The host, or the array?  How does the multipathing software determine the primary volume is none responsive and make the decision to fail over?
    What is the process of re-establishing the failback after an outage/issue on the primary array?  Is this seamless?
    If a failure in a RAID group in the primary array occurs, can only some LDEVs be failed over to the remote array?
    I could go on, but it’s getting tedious.  My point is that claiming 100% availability is a big statement; it means both failover/failback are 100% seamless. 
    Chris

  3. snig Says:

    Chris, thanks for all the great questions.

    100% availability is a much easier task today with the server virtualization platforms and all they can do.  I can’t remember the last time I built a solution without multiple NICs, HBAs, switches (FC and IP), and so on.  The only thing that wasn’t really viable from a technology standpoint was the disk subsystem without having to write a ton of scripts and depend on people to update them when changes were made.  Well with HAM, it seems, that HDS has removed that from the list of issues.

    As far as working to an existing TrueCopy site, I don’t see why it wouldn’t work.  I’ve been told that HAM will work at TC Sync distances and will be able to use HUR for a tertiary site as well.

    Active/Passive is correct, but as with any clustering technology it takes additional work if you want Active/Active.  Nothing really has changed there.  They haven’t reinvented clustering, just made it available on their subsystem.

    "Failover" is addressed in this bullet:

    • When I/O’s fail to all P-VOL owner paths, alternate path software issues I/O to the non-owner S-VOL paths.  Storage controller stops the copy of cache data and the S-VOL becomes write enabled.

    So the host will handle that failover.  At first HDLM and then other failover software will follow.  The decision will be made based on the same criteria it is today.  A path goes away and times out would be my immediate assumption.

    The RAID group scenario is an great question.  I don’t know that answer.  Thinking through it outloud though it seems to me that you would be able to failover only some LDEVs since you can do that with TC.  But would we have to failover the entire consistency group?

  4. Earnest Henderson Says:

    Very nice write-up of this new feature – much more information than Hitachi has posted yet.One quick question or three:  Assuming a condition occurs which triggers a failover to the secondary / passive side, how long does this failover operation take?  In other words, how long before I/O is being serviced normally to the attached hosts from the secondary volumes?   Is this length of time the same for a pair of controllers with 500 volumes vs. 50,000 volumes?

Leave a Reply