Cloud Archives for Availability and Resilience
People use the word "cloud" to mean anything, so now it means nothing. First, large portable external disk drives were called "Your own personal cloud", and then USB thumb drives got that label. Some Microsoft ads seem to mean "software" when they use the term. Let's stick to the original meaning of remote large data centers where customers store and process their data.This page is part of the Availability cybersecurity collection
IaaS or Infrastructure as a Service is where you're renting virtualized servers in a remote data center. They provide the infrastructure, you have to take responsibility for system administration.
IaaS can be very rugged, given the huge investments in physical facilities, redundant hardware, on-site power generation capacity, and redundant network connectivity. So far, major providers including Amazon, Google, and Microsoft look like they are in it for the indefinitely long run. See my cloud security page for details.
What we might call "Storage as a Service" can be similarly rugged. Some providers are simply reselling Amazon Web Services' S3, Glacier, and other storage services wrapped in their more convenient interfaces and packaging. Why not? AWS is the biggest provider in terms both of what they offer and their investment in resilient infrastructure.
Remember two useful pieces of folk wisdom:
You get what you pay for.
If it looks too good to be true, it probably is.
We can't trust free cloud storage
Some ISPs offer free storage. The service becomes popular, thus expensive, and they shut it down.
Free storage services may be terminated with very little notice. Users must extract their data and find somewhere else to store, share, and process it.
Flickr used to offer 1 TB of free storage. Then, on 1 Nov 2018 they announced that free users would be limited to 1,000 JPEG images. In 97 days, on 5 Feb 2019, Flickr would automatically delete all but the most recent 1,000 image files.
Google announced the end of Google+ in December 2018. On February 1, 2019, they sent an email: "On April 2nd, your Google+ account and any Google+ pages you created will be shut down and we will begin deleting content from consumer Google+ accounts. Photos and videos from Google+ in your Album Archive and your Google+ pages will also be deleted."
Amazon announced the end of its unlimited cloud storage plan in 2017.
Bitcasa announced an end to unlimited low-cost storage in October 2014, giving customers just about 3 weeks to pull out their data before the company deletes it.
T-Mobile dropped their free MobileLife Album picture-storing service with just 26 days notice in June 2013. There was industry news about this, but I'm a T-Mobile customer and I only received notice from the company 26 days before the data was deleted and the service shut down. I'm glad that I wasn't using it.
Nirvanix simply shut down all operations with only two weeks warning in September 2013. See stories in Wired and Computer Weekly.
SugarSync dropped a free storage service with two months warning in December 2013. See stories in TechCrunch and Time.
Some times there's a little more warning. Microsoft provided 6 months warning in 2017 that they were shutting down the Docs.com file-sharing site.
Companies like Wiredrive and Dropbox have easy-to-use storage services that you pay for. The prices may go up, but once you're an existing customer it seems that you're safe.
That leads to another question: For paid services, what keeps a provider from doubling their price, or maybe multiplying it by 10 or more, with no warning? Nothing!
We really can't trust free cloud storage
Myspace, was the most popular social media site between 2005 and 2008 and a leading music-sharing site. In May 2019, they made an error while migrating data to new servers. Myspace lost the content uploaded to the site between 2003 and 2015, potentially 50 million songs. Several media outlets reported on the loss:
Archiving data in Google Cloud Platform
In my opinion, GCP is the best choice for archiving. That is, for backups and archives retrieved infrequently or never.
There are two tiers: Nearline and Coldline. The pricing, at least when I wrote this, was:
|Storage cost||$0.010 / GB-month||$0.007 / GB-month|
|Retrieval cost||$0.01 / GB||$0.05 / GB|
|Minimum duration||30 days||90 days|
You're billed for the entire minimum duration if you delete or replace an object before that time.
Google Coldline costs a little more than AWS Glacier (Google is cheaper than AWS for all the other storage tiers), but the interface is much better.
Google has a great interface to check your storage inventory. It's easy to add, delete, and manage storage with new folders (and even sub-folders). And, Google sees uploaded archive files as files, not the vague object blobs of AWS.
Above is what you see on the dashboard. Notice the "Hold" column. When you place a "hold" on a file, it prevents you from deleting it until you go through the process of removing the hold. That is, it prevents accidental or casual deletion.
Also notice the "Encryption" column. Yes, your data is encrypted, but not in any way that is useful or even detectable by you. Google says they are encrypting the data, and Google's explanation describes how, but we have to take their word for it. As far as we can tell, our data is always cleartext. The encryption and decryption happens between our user interface and their storage infrastructure. This encryption would help to block an inside threat at Google, or maybe the sort of attack you would only find in a Mission: Impossible movie plot. But hey, it's standard, and you don't have to do or even understand anything to use the service.
Below is the command line view.
[cromwell@desktop ~]$ gsutil ls gs://cromwell-intl-backup/ gs://cromwell-intl-backup/etc.tar.xz gs://cromwell-intl-backup/journal-notes.tar.xz gs://cromwell-intl-backup/web-backup-2019-01-29.tar [cromwell@desktop ~]$ gsutil ls -l gs://cromwell-intl-backup/ 3724264 2019-01-29T18:23:58Z gs://cromwell-intl-backup/etc.tar.xz 27965164 2019-01-29T18:25:23Z gs://cromwell-intl-backup/journal-notes.tar.xz 3140843520 2019-01-29T21:59:44Z gs://cromwell-intl-backup/web-backup-2019-01-29.tar TOTAL: 3 objects, 3172532948 bytes (2.95 GiB)
-L instead of
-l for lots more
Create a Google Cloud Platform account, and then do these two "quickstart" tutorials:
Be patient. An Internet speed test gives you a report in Mbps, megaBITS per second. Your file sizes will be reported in megaBYTES. If you're not careful, you will be disappointed to find that a transfer takes 8 times as long as you expected.
To be pedantic, it's even a little longer than that. File sizes are reported in sizes based on powers of two, not ten. One gigabyte is 1,073,741,824 bytes, not 1,000,000,000. That is, 230 and not 109. Network speeds, however, are reported in powers of ten.
A 100 GB archive will take almost 5 days to upload over a 2 Mbps link:
100 GB × 1,073,741,824 bytes/GB × 8 bits/byte = 858,993,459,200 bits
858,993,459,200 bits / 2,000,000 bits/second = 429,497 seconds
429,497 seconds / 3,600 seconds/hour = 119.3 hours = 4 days, 23 hours, 18 minutes
Doing the algebra for you — If you have an X-gigabyte archive to upload over a Y-Mbps connection, it will take about 2.386×X/Y hours.
Smaller archives will be easier to handle.
I have organized my pictures into directories named,
I can easily make one archive per year. That's much easier to manage than one gigantic archive containing all my pictures since I first bought a digital camera in 2003!
Archiving data in AWS Glacier
Amazon's Glacier storage service costs even less, just US$ 0.004 per gigabyte per month. But, its interface is awkward and slow. I'll happily pay a little more to Google and have a much better interface.
How awkward is AWS Glacier? Let's look at deleting last year's archive. The archive is a storage blob, an object, within a vault. Hopefully you gave the vault a name that tells you specifically what's in it.
You try to delete the vault, and it tells you that you can't, as it contains something. That's reasonable. That's cautious design. But maybe I am certain that I really want to delete the archive in the vault and the vault itself.
Oh, but I can't just do that. I have to tell it to delete the archive object by name. Fine, what's its name? I have to ask, it takes 4 hours to retrieve the name of the archive object. And, I have to use a tool outside the dashboard interface.
Four hours later I have its enormous name. Something like this:
Now I can delete the empty vault, right? Well, no. I have to wait at least another hour or two. Probably until some time tomorrow. Only then I can delete it.
If you still want to do this,
tools to interact with Glacier include:
Boto, an AWS SDK for Python HashBackup, command-line for Linux, macOS, and BSD CrossFTP, graphical client for Linux, macOS, and Windows SAGU or Simple Amazon Glacier Uploader, graphical client for Linux, macOS, BSD, and Windows mt-aws-glacier, Perl multi-threaded multi-part sync tool
Boto is useful for an initial test. Upload a small (1-3 GB) archive to make sure it works and to get an idea for the amount of time required per upload.
Do your uploading with SAGU. It opens a window that it describes as a "progress bar" but it gives you no idea of speed or amount of progress, just that it's trying to do something. Instead, watch your outbound network utilization.
I saved the SAGU Java archive file and then created this
shell script in
I put the actual Access Key ID and Secret Access Key
in the file, so ownership and permissions are crucial!
#!/bin/sh java -jar ~/bin/SimpleGlacierUploaderV0746.jar & > > /tmp/sagu-output 2>&1 cat << EOF Access Key ID: A⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀ Secret Access Key: B⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀ EOF
Expect the occasional error message like the following. It should be OK, the announced retry should work:
Jul 12, 2020 7:15:35 AM org.apache.http.impl.client.DefaultHttpClient tryExecute INFO: I/O exception (java.net.SocketException) caught when processing request: Broken pipe Jul 12, 2020 7:15:35 AM org.apache.http.impl.client.DefaultHttpClient tryExecute INFO: Retrying request
The first image is the main SAGU window. The second is the "progress bar" window, which is neither exciting nor informative.
Here is the result of signing into AWS, selecting one of my vaults, and viewing the details.
How can cloud providers claim such high availability for their cloud storage?
According to their descriptions of their infrastructure, Google and Amazon storage services store multiple copies of your data, in multiple physical locations. Typically at least three copies in at least two physical locations. The hash value of each is periodically calculated. If any one is ever found to differ from the other two, it is recreated from the presumed good pair.
Meanwhile, those stored data objects are periodically re-written onto different physical storage media. And the underlying storage devices are rotated out of service after a specified period of time.
This process of periodic rewriting onto reasonably fresh hardware and comparing cryptographic hash values for the three current copies is designed, as per Amazon, to provide an average annual durability of 99.999999999% for an archive of data.
This estimate is based on the details of their design (frequency of disk replacement, frequency of re-writing the archive copies) and the probabilities of the scenario and environment (probability of RAID array failure leading to data loss, likelihood of cryptographic hash collisions).
Availability topics with their own pages:
On the general Availability page:
Back to the Security Page