It's been quite some time and 2014 has been quite eventful, but I apologize for not writing anything useful for over a year. Incidentally, I've done this experiment back in Aug, but didn't publish it, so here goes.
I've been pondering how to back up my data in a safe place for a while and thought of renting my own server online then use rsync (Windows & Linux) to update contents. To keep costs down, I thought of using deduplication solutions and searched for what is available currently under Linux, since that's what I'll run on my rented server.
I found OpenDedup.org and it looked quite promising, though 2 years old only. Their code trunk on Google Code has commits from July, and their forums do have active users and I saw a patch being mentioned and will be pushed on Aug 24.
I created a VM with Debian 7.6 (x64) & installed the prerequisite, which is Java 7 (Yuck!). The OpenDedup debian package contains the binaries needed for mkfs and CLI tools. There are other packages for web based management, but I didn't try those.
I threw mkv files at the volume (via Samba/Windows share) and the dedup ratio was quite low (1%). Then I went to my work laptop, and copied a directory that contains info about all companies I worked on (projects), and the files are a mix of doc, docx, pdf, xls, xlsx, jpg, png & visio files. The end result is instead of consuming 3.5 GB, it took 2.9 GB only. I copied another directory of size 852 MB, and it was reduced to 730 MB.
There is a penalty that OpenDedup creates logs, hash tables & data chunk files, and those keep growing, so your data has to be large enough to not notice those, but if it's small, like my sample above, you won't see much gain overall.
Another issue with OpenDedup is that in my tests, it was wonky and sometimes the JVM crashed. It rarely happened throughout the day's test, but it did happen, and as it is Java based, it makes me ill and approach it with caution.
The technology looks good, but not useful for multimedia files (videos & images) as those are already compressed. They tout 95% deduplication ratio for VMware virtual machines, though, but that's more because the VMs will run similar operating systems.
OpenDedup is performing deduplication in the background and on an interval basis, so it's not inline (live), which means that one still needs to buy full disk capacity, then "hope" for savings. This is the same for all deduplication technologies in general, not just OpenDedup, which is also seen in enterprise storage systems, such as EMC's, and they have mentioned that in their white papers, in addition to the performance penalty, which is why it's not recommended to use it for production volumes sensitive to latency.
If you're still wondering on whether I found a good way to backup, the answer is: SpiderOak (referral link). In conjunction with my OpenDedup tests, I tested SpiderOak, and they yielded the same deduplication ratio as OpenDedup (on their servers).
SpiderOak has a Zero Knowledge policy and design, which means that their systems can never see what data is being put there, at rest or during upload. Only devices with the client installed and with the correct password can access the data.
In addition to having a smart client that only uploads differences and not entire files, they perform deduplication on their end, so you don't have to pay for a lot of storage.
The last feature that made me love it and stick with it, is the fact that you pay for capacity regardless of the number of devices. You can have an unlimited number of devices using the purchased capacity, and you can traverse all the files for all backed up devices from the same user interface.
I've been pondering how to back up my data in a safe place for a while and thought of renting my own server online then use rsync (Windows & Linux) to update contents. To keep costs down, I thought of using deduplication solutions and searched for what is available currently under Linux, since that's what I'll run on my rented server.
I found OpenDedup.org and it looked quite promising, though 2 years old only. Their code trunk on Google Code has commits from July, and their forums do have active users and I saw a patch being mentioned and will be pushed on Aug 24.
I created a VM with Debian 7.6 (x64) & installed the prerequisite, which is Java 7 (Yuck!). The OpenDedup debian package contains the binaries needed for mkfs and CLI tools. There are other packages for web based management, but I didn't try those.
I threw mkv files at the volume (via Samba/Windows share) and the dedup ratio was quite low (1%). Then I went to my work laptop, and copied a directory that contains info about all companies I worked on (projects), and the files are a mix of doc, docx, pdf, xls, xlsx, jpg, png & visio files. The end result is instead of consuming 3.5 GB, it took 2.9 GB only. I copied another directory of size 852 MB, and it was reduced to 730 MB.
There is a penalty that OpenDedup creates logs, hash tables & data chunk files, and those keep growing, so your data has to be large enough to not notice those, but if it's small, like my sample above, you won't see much gain overall.
Another issue with OpenDedup is that in my tests, it was wonky and sometimes the JVM crashed. It rarely happened throughout the day's test, but it did happen, and as it is Java based, it makes me ill and approach it with caution.
The technology looks good, but not useful for multimedia files (videos & images) as those are already compressed. They tout 95% deduplication ratio for VMware virtual machines, though, but that's more because the VMs will run similar operating systems.
OpenDedup is performing deduplication in the background and on an interval basis, so it's not inline (live), which means that one still needs to buy full disk capacity, then "hope" for savings. This is the same for all deduplication technologies in general, not just OpenDedup, which is also seen in enterprise storage systems, such as EMC's, and they have mentioned that in their white papers, in addition to the performance penalty, which is why it's not recommended to use it for production volumes sensitive to latency.
If you're still wondering on whether I found a good way to backup, the answer is: SpiderOak (referral link). In conjunction with my OpenDedup tests, I tested SpiderOak, and they yielded the same deduplication ratio as OpenDedup (on their servers).
SpiderOak has a Zero Knowledge policy and design, which means that their systems can never see what data is being put there, at rest or during upload. Only devices with the client installed and with the correct password can access the data.
In addition to having a smart client that only uploads differences and not entire files, they perform deduplication on their end, so you don't have to pay for a lot of storage.
The last feature that made me love it and stick with it, is the fact that you pay for capacity regardless of the number of devices. You can have an unlimited number of devices using the purchased capacity, and you can traverse all the files for all backed up devices from the same user interface.