Cloud Archiving
Cloud Archives for Availability and Resilience
People use the word "cloud" to mean anything, so now it means nothing. First, large portable external disk drives were called "Your own personal cloud", and then USB thumb drives got that label. Some Microsoft ads seem to mean "software" when they use the term. Let's stick to the original meaning of remote large data centers where customers store and process their data.
This page is part of the Availability cybersecurity collectionIaaS or Infrastructure as a Service is where you're renting virtualized servers in a remote data center. They provide the infrastructure, you have to take responsibility for system administration.
IaaS can be very rugged, given the huge investments in physical facilities, redundant hardware, on-site power generation capacity, and redundant network connectivity. So far, major providers including Amazon, Google, and Microsoft look like they are in it for the indefinitely long run. See my cloud security page for details.
What we might call "Storage as a Service" can be similarly rugged. Some providers are simply reselling Amazon Web Services' S3, Glacier, and other storage services wrapped in their more convenient interfaces and packaging. Why not? AWS is the biggest provider in terms both of what they offer and their investment in resilient infrastructure.
Remember two useful pieces of folk wisdom:
You get what you pay for.
If it looks too good to be true, it probably is.
We can't trust free cloud storage
Some ISPs offer free storage. The service becomes popular, thus expensive, and they shut it down.
Free storage services may be terminated with very little notice. Users must extract their data and find somewhere else to store, share, and process it.
Flickr used to offer 1 TB of free storage. Then, on 1 Nov 2018 they announced that free users would be limited to 1,000 JPEG images. In 97 days, on 5 Feb 2019, Flickr would automatically delete all but the most recent 1,000 image files.
Google announced the end of Google+ in December 2018. On February 1, 2019, they sent an email: "On April 2nd, your Google+ account and any Google+ pages you created will be shut down and we will begin deleting content from consumer Google+ accounts. Photos and videos from Google+ in your Album Archive and your Google+ pages will also be deleted."
Amazon announced the end of its unlimited cloud storage plan in 2017.
Bitcasa announced an end to unlimited low-cost storage in October 2014, giving customers just about 3 weeks to pull out their data before the company deletes it.
T-Mobile dropped their free MobileLife Album picture-storing service with just 26 days notice in June 2013. There was industry news about this, but I'm a T-Mobile customer and I only received notice from the company 26 days before the data was deleted and the service shut down. I'm glad that I wasn't using it.
Nirvanix simply shut down all operations with only two weeks warning in September 2013. See stories in Wired and Computer Weekly.
SugarSync dropped a free storage service with two months warning in December 2013. See stories in TechCrunch and Time.
Some times there's a little more warning. Microsoft provided 6 months warning in 2017 that they were shutting down the Docs.com file-sharing site.
Companies like Wiredrive and Dropbox have easy-to-use storage services that you pay for. The prices may go up, but once you're an existing customer it seems that you're safe.
That leads to another question: For paid services, what keeps a provider from doubling their price, or maybe multiplying it by 10 or more, with no warning? Nothing!
We really can't trust free cloud storage
Myspace, was the most popular social media site between 2005 and 2008 and a leading music-sharing site. In May 2019, they made an error while migrating data to new servers. Myspace lost the content uploaded to the site between 2003 and 2015, potentially 50 million songs. Several media outlets reported on the loss:
Archiving data in Google Cloud Platform
In my opinion, GCP is the best choice for archiving. That is, for backups and archives retrieved infrequently or never.
There are three tiers of long-term storage: Nearline, Coldline, and Archive. The pricing, at least when I wrote this, was:
Standard | Nearline | Coldline | Archive | |
Storage cost | $0.020 / GB-month | $0.010 / GB-month | $0.004 / GB-month | $0.0012 / GB-month |
Retrieval cost | $0.00 / GB | $0.01 / GB | $0.02 / GB | $0.05 / GB |
Minimum duration | — | 30 days | 90 days | 365 days |
Storage costs are for the cheapest zones in North America. Retrieval cost is in addition to egress network charges. |
You're billed for the entire minimum duration if you delete or replace an object before that time.
Google Coldline costs a little more than AWS Glacier (Google is cheaper than AWS for all the other storage tiers), but the interface is much better.
Google has a great interface to check your storage inventory. It's easy to add, delete, and manage storage with new folders (and even sub-folders). And, Google sees uploaded archive files as files, not the vague object blobs of AWS.
Above is what you see on the dashboard. Notice the "Hold" column. When you place a "hold" on a file, it prevents you from deleting it until you go through the process of removing the hold. That is, it prevents accidental or casual deletion.
Also notice the "Encryption" column. By default, your data is encrypted automatically. However, it's done in a way that you can't detect. You or a regulator must simply accept Google's explanation. If you need to create and manage your own encryption key, that's now an option. See Google's data encryption documentation for details on your encryption options.
Below is the command line view of storage.
[cromwell@desktop ~]$ gsutil ls gs://cromwell-intl-backup/ gs://cromwell-intl-backup/bind-dhcp.tar gs://cromwell-intl-backup/desktop-etc.tar.xz gs://cromwell-intl-backup/journal-notes.tar.xz gs://cromwell-intl-backup/laptop-etc.tar.xz gs://cromwell-intl-backup/laptop-thunderbird.tar.xz gs://cromwell-intl-backup/other.tar gs://cromwell-intl-backup/pictures-2003.tar gs://cromwell-intl-backup/pictures-2004.tar gs://cromwell-intl-backup/pictures-2005.tar gs://cromwell-intl-backup/pictures-2006.tar gs://cromwell-intl-backup/pictures-2007.tar gs://cromwell-intl-backup/pictures-2008.tar [...many more lines...] [cromwell@desktop ~]$ gsutil ls -l gs://cromwell-intl-backup/ 71680 2022-02-26T17:51:27Z gs://cromwell-intl-backup/bind-dhcp.tar 1349768 2022-02-26T17:51:40Z gs://cromwell-intl-backup/desktop-etc.tar.xz 141362588 2022-02-26T17:56:13Z gs://cromwell-intl-backup/journal-notes.tar.xz 1466928 2022-02-26T17:52:08Z gs://cromwell-intl-backup/laptop-etc.tar.xz 672714936 2022-02-26T18:15:02Z gs://cromwell-intl-backup/laptop-thunderbird.tar.xz 27288340480 2022-02-27T11:30:45Z gs://cromwell-intl-backup/other.tar 119285760 2019-01-30T22:14:57Z gs://cromwell-intl-backup/pictures-2003.tar 2072709120 2019-01-30T23:04:11Z gs://cromwell-intl-backup/pictures-2004.tar 2280632320 2019-01-30T23:58:16Z gs://cromwell-intl-backup/pictures-2005.tar 9215580160 2019-01-31T03:37:12Z gs://cromwell-intl-backup/pictures-2006.tar 5336176640 2019-01-31T05:44:09Z gs://cromwell-intl-backup/pictures-2007.tar 5995212800 2019-01-31T08:06:30Z gs://cromwell-intl-backup/pictures-2008.tar [...many more lines...] [cromwell@desktop ~]$ gsutil ls -lh gs://cromwell-intl-backup/ 70 KiB 2022-02-26T17:51:27Z gs://cromwell-intl-backup/bind-dhcp.tar 1.29 MiB 2022-02-26T17:51:40Z gs://cromwell-intl-backup/desktop-etc.tar.xz 134.81 MiB 2022-02-26T17:56:13Z gs://cromwell-intl-backup/journal-notes.tar.xz 1.4 MiB 2022-02-26T17:52:08Z gs://cromwell-intl-backup/laptop-etc.tar.xz 641.55 MiB 2022-02-26T18:15:02Z gs://cromwell-intl-backup/laptop-thunderbird.tar.xz 25.41 GiB 2022-02-27T11:30:45Z gs://cromwell-intl-backup/other.tar 113.76 MiB 2019-01-30T22:14:57Z gs://cromwell-intl-backup/pictures-2003.tar 1.93 GiB 2019-01-30T23:04:11Z gs://cromwell-intl-backup/pictures-2004.tar 2.12 GiB 2019-01-30T23:58:16Z gs://cromwell-intl-backup/pictures-2005.tar 8.58 GiB 2019-01-31T03:37:12Z gs://cromwell-intl-backup/pictures-2006.tar 4.97 GiB 2019-01-31T05:44:09Z gs://cromwell-intl-backup/pictures-2007.tar 5.58 GiB 2019-01-31T08:06:30Z gs://cromwell-intl-backup/pictures-2008.tar [...many more lines...]
Use -L
instead of -l
for lots more
information in JSON format.
Create a Google Cloud Platform account, and then do these two "quickstart" tutorials:
Be patient. An Internet speed test gives you a report in Mbps, megaBITS per second. Your file sizes will be reported in megaBYTES. If you're not careful, you will be disappointed to find that a transfer takes 8 times as long as you expected.
Network speeds are reported in powers of ten. File sizes, however, might be reported in powers of two or in powers of ten. One gigabyte is 109 bytes, or 1,000,000,000. One gibibyte, however, is 230 bytes, or 1,073,741,824. You can do the math yourself, or you can ask the tool to do the math (including rounding off!) for you.
$ ls -l archive.tar.xz -rw-rw-r--. 1 cromwell cromwell 8397185024 Feb 21 15:26 archive.tar.xz $ ls -lh --block-size=MB archive.tar.xz -rw-rw-r--. 1 cromwell cromwell 8398MB Feb 21 15:26 archive.tar.xz $ ls -lh --block-size=MiB archive.tar.xz -rw-rw-r--. 1 cromwell cromwell 8009MiB Feb 21 15:26 archive.tar.xz $ ls -lh --block-size=GB archive.tar.xz -rw-rw-r--. 1 cromwell cromwell 9GB Feb 21 15:26 archive.tar.xz $ ls -lh --block-size=GiB archive.tar.xz -rw-rw-r--. 1 cromwell cromwell 8GiB Feb 21 15:26 archive.tar.xz
A 100 GB archive will take almost 5 days to upload over a 2 Mbps link:
100 GB × 1,073,741,824 bytes/GB × 8 bits/byte = 858,993,459,200 bits
858,993,459,200 bits / 2,000,000 bits/second = 429,497 seconds
429,497 seconds / 3,600 seconds/hour = 119.3 hours = 4 days, 23 hours, 18 minutes
Doing the algebra for you — If you have an X-gigabyte archive to upload over a Y-Mbps connection, it will take about 2.386×X/Y hours.
Smaller archives will be easier to handle.
I have organized my pictures into directories named,
for example:
Pictures/pictures/alaska-2017-11-nov/
Pictures/pictures/japan-2017-04-apr/
Pictures/pictures/los-angeles-2017-08-aug/
Pictures/pictures/new-york-2017-03-mar/
[...]
I can easily make one archive per year.
That's much easier to manage than one gigantic
archive containing all my pictures since
I first bought a digital camera in 2003!
Archiving data in AWS Glacier
Amazon's Glacier storage service may cost slightly less for some categories and sizes. But, its interface is awkward and slow. I'll happily pay a little more to Google and have a much better interface.
How awkward is AWS Glacier? Let's look at deleting last year's archive. The archive is a storage blob, an object, within a vault. Hopefully you gave the vault a name that tells you specifically what's in it.
You try to delete the vault, and it tells you that you can't, as it contains something. That's reasonable. That's cautious design. But maybe I am certain that I really want to delete the archive in the vault and the vault itself.
Oh, but I can't just do that. I have to tell it to delete the archive object by name. Fine, what's its name? I have to ask, it takes 4 hours to retrieve the name of the archive object. And, I have to use a tool outside the dashboard interface.
Four hours later I have its enormous name. Something like this:
HuGGPbf47imi6O39u06zPUuk3urnKFPMv7TqjYpeCPgPY6dzTMi4d2D5-w4ZOwI7No0vA3HftNAa1ow05SxhiA4ffLXogAzzSmGeUTJ0otbNCS6XtRMMAlArSrNVfQjEzUFpP01X8G
Now I can delete the empty vault, right? Well, no. I have to wait at least another hour or two. Probably until some time tomorrow. Only then I can delete it.
If you still want to do this,
tools to interact with Glacier include:
Boto, an AWS SDK for Python
HashBackup, command-line for
Linux, macOS, and BSD
CrossFTP,
graphical client for Linux, macOS, and Windows
SAGU or Simple Amazon Glacier Uploader,
graphical client for Linux, macOS, BSD, and Windows
mt-aws-glacier,
Perl multi-threaded multi-part sync tool
Boto is useful for an initial test. Upload a small (1-3 GB) archive to make sure it works and to get an idea for the amount of time required per upload.
Do your uploading with SAGU. It opens a window that it describes as a "progress bar" but it gives you no idea of speed or amount of progress, just that it's trying to do something. Instead, watch your outbound network utilization.
I saved the SAGU Java archive file and then created this
shell script in
~/bin/glacier-sagu
.
I put the actual Access Key ID and Secret Access Key
in the file, so ownership and permissions are crucial!
#!/bin/sh java -jar ~/bin/SimpleGlacierUploaderV0746.jar & > > /tmp/sagu-output 2>&1 cat << EOF Access Key ID: A⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀ Secret Access Key: B⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀ EOF
Expect the occasional error message like the following. It should be OK, the announced retry should work:
Jan 08, 2025 10:22:29 AM org.apache.http.impl.client.DefaultHttpClient tryExecute INFO: I/O exception (java.net.SocketException) caught when processing request: Broken pipe Jan 08, 2025 10:22:29 AM org.apache.http.impl.client.DefaultHttpClient tryExecute INFO: Retrying request
The first image is the main SAGU window. The second is the "progress bar" window, which is neither exciting nor informative.
Here is the result of signing into AWS, selecting one of my vaults, and viewing the details.
Archive Resilience
How can cloud providers claim such high availability for their cloud storage?
According to their descriptions of their infrastructure, Google and Amazon storage services store multiple copies of your data, in multiple physical locations. Typically at least three copies in at least two physical locations. The hash value of each is periodically calculated. If any one is ever found to differ from the other two, it is recreated from the presumed good pair.
Meanwhile, those stored data objects are periodically re-written onto different physical storage media. And the underlying storage devices are rotated out of service after a specified period of time.
This process of periodic rewriting onto reasonably fresh hardware and comparing cryptographic hash values for the three current copies is designed, as per Amazon, to provide an average annual durability of 99.999999999% for an archive of data.
This estimate is based on the details of their design (frequency of disk replacement, frequency of re-writing the archive copies) and the probabilities of the scenario and environment (probability of RAID array failure leading to data loss, likelihood of cryptographic hash collisions).
Availability topics with their own pages:
On the general Availability page: