INCLUDE_DATA

Data deduplication

It has been a while, but here's something I am trying to figure out for some time now.

We've bought a Diligent VTF with ProtecTIER (data deduplication on virtual tapes) in the first quarter this year. The main reason was that nice feature where we could store up to 25 times the amount of actual disk space we had available for virtual tape.
Since then, almost all virtual tape vendors have some form of data deduplication. I guess this is a flaming hot feature. The big-irons are lagging behind a bit, but no doubt they will soon step up, and try to dominate this field too.

The one thing I wonder about is, since storage is this hot, why the disk vendors aren't incorperating data deduplication in their subsystems?

  • Is it  the fact that they will loose revenue on the disk sales? (my best guess).
  • Is it because it is to hard to write the code? I don't think this is a problem though.

Computer room cooling, energy costs, floor space should all be factors to take into consideration. These factors would justify data deduplication on disk storage from an end-user point of view. Vendors always find ways to increase their profits, even though the hardware prices are dropping constantly. I see no obstacle here.

Up untill 2003 we were using IBM's RVA (StorageTek OEM)  on mainframe systems. The logstructured volumes were perfect. Compression was done in de RVA, without the host knowing it. We were able to store about three times the amount of data than was actually present in the box. Need more volumes. No problem. We'd carve out another model 9 volume (9GB), and no actual physical disk space was consumed, untill some data was stored on it.
When StorageTek came out with the successor of the RVA, called the V2X, we were eager to use it because it also supported Open systems lun's.

Unfortunately the design of the V2X had some flaws. Especially in the microcode. The V2X wasn't stable enough for us to run open systems volumes on it. The compression and snapshot mechanisms worked fine, but paths from the host to the V2X kept on dropping connections. We decided to send it back to STK, and we continued our storage services on the IBM Sharks.

So doing compression and deduplication in the storage box is just a heartbeat away. Now we just need to wait for the first vendor to pick up the gauntlet and start shipping the storage box including dedupe code. The rest will soon follow.

 Need someone to test for you? Just give me a call ;-)

  • Share/Bookmark

8 Responses to “Data deduplication”

  1. Storagezilla Says:

    What is the spec of the ProtecTier server(s) you went for?

  2. c2olen Says:

    We're running on a HP DL585G1, 8GB Memory, 4 CPU's, 4 Front end Emulex ports, and 4 backend Qlogic ports. The Emulex ports get reconfigured to be used as a target device for tape emulation. The back-end storage is currently running on a DS4300 Turbo SATA box, fully equiped. Running smoothly (now).

    It wasn't a smooth path from the beginning though. In the beginning there were several issues with failover on backend disks, due to the fact that the multipath software wasn't really incorperated into the "blackbox" concept we though it would be. We experimented with device-mapper-multipath packages, RDAC en default kernel failover. The ProtecTier runs Redhat EL 4 update 2 under the hood. New configuration concepts in the DS4300 in combination with RDAC eventually did the trick.  

    Another issue was raised when we did a path-fail test without multipath failover enabled. The ext3 filesystems were corrupted. The journalling in ext3 and fsck commands fixed the filesystems, but within the ProtecTier software, the metadata corruption couldn't be fixed. A patch to the software fixed this incorrect reporting of corruption.

    In the meanwhile all problems seem fixed, as all tests went fine after applying patch 1.2.1.9.

    As a side note i have to address is the fact that the Diligent support was fenominal and open. I recieved notification of all ticket updates made. Including comments and remarks from the development folks. This isn't common to my knowledge.

    I think you HDS people are interested in all information about the Diligent VTF stuff, since HDS has a partnership with Diligent, right?

  3. Storagezilla Says:

    I’m not a HDS person, but it’s interesting to read your impressions. ;)

  4. c2olen Says:

    Storagezilla, I wasn’t actually refering to you, when i mentioned HDS people, but I know that a couple of HDS’s crew follow this weblog.

    I am glad your interested. We’re taking the Diligent machine into production right now. I’ll post some stats on it after a couple of weeks.

  5. Nigel (mackem) Says:

    Data de-dupiing in a storage box – worth some thought! Especially since the guys at Diligent claim that they can map 1PB fo storage using just 4GB RAM.

    Id really be keen to know if you get anywhere near the 25:1 claims being made about this solution. Obviously it will be a while befire you will know but please keep us posted.

    Im also really interested to know how they guarantee 100% data integrity? A couple of years ago a company I worked for had a little think about using a HP product called RISS which did a form of single instance storage that applied hashes against data being saved. I remember at the time being worried about the slight chances of two different sets of data generating the same hash and being mistaken for the same data. Do diligent provide other safety features?

    Also when you start getting close to the 25:1 ratio Id be interested to know how much of an impact this has on restore time? Are you backing off to tape at all?

    There seem to be lots of claims out there about de-duping prodcuts but I really dont know what to believe – “there are lies, damned lies and marketing materials”

  6. c2olen Says:

    The 25:1 factoring ratio, as Diligent calls it, was one of the reasons we took a peek at the product.
    But the factoring greatly depends on a variety of variables.

    • What is the retention period?
    • How many versions of backup data do you keep?
    • How is your backup cycle configured. Daily full, incrementals, diff and so on.
    • Is the data already compressed by the client?
    • And some more….. 

    We have it connected to our TSM servers. On the clients, we are not in the position to do uncompressed backups, because the network interfaces are not capable of doing high bandwith traffic. Please don't start on this, because i've been trying to get those server admins to upgrade or reconfigure to at least 1 gigE.
    Go lanfree you think? Yeah sure, we would, if those damn servers would be running anything else than AIX4.3. Don't start on this either. I'll give you the legacy software excuse for this ;-)

    When doing prelimenary tests, we would be backing up some Oracle instances multiple times, and the factoring did kick in and went up to 20:1.
    But this was not a real-life situation at all. On a variety of files, filetypes, filesizes, factoring drops until it goes up again after several regular backup cycles.
    We've started with the assumption (based on Diligent homework) with a factoring of 10:1. With the current data being client compressed already, we manage to get a factoring of 4:1, which is still good from my point of view.

    The image below shows the current ratio. Only a bit data is stored, in client compressed form, so this isn't really a good reference. But thought you'd like to see some information.
    The curves in the graph indicates the fluctuation when brand new data is stored, and later multiple versions are stored.

    Diligent ProtecTier Gui (repository)

    If the image resizing made it too blurry for your eyes, check this life-size version right here. 

  7. Chris M Evans Says:

    I did some of the first customer releases of RVA (or Iceberg) back in 1996 when StorageTek first released it. It was a great product although the downfall was undoubtedly performance. When IBM OEM’d it I had a fantastic meeting with some of the people who then went on to write the redbook on RVA. I remember having to explain to one guy 4 times what the capacity of the array “could” be….

    So, 4:1 – not as good as the promised 25:1 that would sell this to management….

  8. c2olen Says:

    Chris, you are right, 4:1 isn’t near what was promised.

    If you have read all the comments carefully, you should also have noticed that we are not running backups and tape storage the according to the “best practices”.
    We still use client compression on the majority of our systems.
    This is due to chargebacks, based on the network traffic to our TSM servers. When we disable client compression, the amount of network traffic suddenly increased bigtime, and the chargeback also.
    We are doing the changes gradually now, and the factoring rate is increasing.

    Based on our policies and retentions, we were promissed 10:1 factoring. I believe we will exceed this a bit, in about a couple of months.

Leave a Reply