Cloud Archiving

Cloud Archives for Availability and Resilience

People use the word "cloud" to mean anything, so now it means nothing. First, large portable external disk drives were called "Your own personal cloud", and then USB thumb drives got that label. Some Microsoft ads seem to mean "software" when they use the term. Let's stick to the original meaning of remote large data centers where customers store and process their data.

This page is part of the Availability cybersecurity collection

IaaS or Infrastructure as a Service is where you're renting virtualized servers in a remote data center. They provide the infrastructure, you have to take responsibility for system administration.

IaaS can be very rugged, given the huge investments in physical facilities, redundant hardware, on-site power generation capacity, and redundant network connectivity. So far, major providers including Amazon, Google, and Microsoft look like they are in it for the indefinitely long run. See my cloud security page for details.

What we might call "Storage as a Service" can be similarly rugged. Some providers are simply reselling Amazon Web Services' S3, Glacier, and other storage services wrapped in their more convenient interfaces and packaging. Why not? AWS is the biggest provider in terms both of what they offer and their investment in resilient infrastructure.

Remember two useful pieces of folk wisdom:

You get what you pay for.

If it looks too good to be true, it probably is.

We can't trust free cloud storage

Some ISPs offer free storage. The service becomes popular, thus expensive, and they shut it down.

Free storage services may be terminated with very little notice. Users must extract their data and find somewhere else to store, share, and process it.

Flickr used to offer 1 TB of free storage. Then, on 1 Nov 2018 they announced that free users would be limited to 1,000 JPEG images. In 97 days, on 5 Feb 2019, Flickr would automatically delete all but the most recent 1,000 image files.

Google announced the end of Google+ in December 2018. On February 1, 2019, they sent an email: "On April 2nd, your Google+ account and any Google+ pages you created will be shut down and we will begin deleting content from consumer Google+ accounts. Photos and videos from Google+ in your Album Archive and your Google+ pages will also be deleted."

Amazon announced the end of its unlimited cloud storage plan in 2017.

Bitcasa announced an end to unlimited low-cost storage in October 2014, giving customers just about 3 weeks to pull out their data before the company deletes it.

T-Mobile dropped their free MobileLife Album picture-storing service with just 26 days notice in June 2013. There was industry news about this, but I'm a T-Mobile customer and I only received notice from the company 26 days before the data was deleted and the service shut down. I'm glad that I wasn't using it.

Nirvanix simply shut down all operations with only two weeks warning in September 2013. See stories in Wired and Computer Weekly.

SugarSync dropped a free storage service with two months warning in December 2013. See stories in TechCrunch and Time.

Some times there's a little more warning. Microsoft provided 6 months warning in 2017 that they were shutting down the Docs.com file-sharing site.

Companies like Wiredrive and Dropbox have easy-to-use storage services that you pay for. The prices may go up, but once you're an existing customer it seems that you're safe.

That leads to another question: For paid services, what keeps a provider from doubling their price, or maybe multiplying it by 10 or more, with no warning? Nothing!

We really can't trust free cloud storage

Myspace, was the most popular social media site between 2005 and 2008 and a leading music-sharing site. In May 2019, they made an error while migrating data to new servers. Myspace lost the content uploaded to the site between 2003 and 2015, potentially 50 million songs. Several media outlets reported on the loss:

CNN

The New York Times

CBS News

Ars Technica

Archiving data in Google Cloud Platform

In my opinion, GCP is the best choice for archiving. That is, for backups and archives retrieved infrequently or never.

There are three tiers of long-term storage: Nearline, Coldline, and Archive. The pricing, at least when I wrote this, was:

	Standard	Nearline	Coldline	Archive
Storage cost	$0.020 / GB-month	$0.010 / GB-month	$0.004 / GB-month	$0.0012 / GB-month
Retrieval cost	$0.00 / GB	$0.01 / GB	$0.02 / GB	$0.05 / GB
Minimum duration	—	30 days	90 days	365 days
Storage costs are for the cheapest zones in North America. Retrieval cost is in addition to egress network charges.

You're billed for the entire minimum duration if you delete or replace an object before that time.

Google Coldline costs a little more than AWS Glacier (Google is cheaper than AWS for all the other storage tiers), but the interface is much better.

Google has a great interface to check your storage inventory. It's easy to add, delete, and manage storage with new folders (and even sub-folders). And, Google sees uploaded archive files as files, not the vague object blobs of AWS.

Google Cloud Platform storage bucket detail page

Above is what you see on the dashboard. Notice the "Hold" column. When you place a "hold" on a file, it prevents you from deleting it until you go through the process of removing the hold. That is, it prevents accidental or casual deletion.

Also notice the "Encryption" column. By default, your data is encrypted automatically. However, it's done in a way that you can't detect. You or a regulator must simply accept Google's explanation. If you need to create and manage your own encryption key, that's now an option. See Google's data encryption documentation for details on your encryption options.

Below is the command line view of storage.

[cromwell@desktop ~]$ gsutil ls gs://cromwell-intl-backup/
gs://cromwell-intl-backup/bind-dhcp.tar
gs://cromwell-intl-backup/desktop-etc.tar.xz
gs://cromwell-intl-backup/journal-notes.tar.xz
gs://cromwell-intl-backup/laptop-etc.tar.xz
gs://cromwell-intl-backup/laptop-thunderbird.tar.xz
gs://cromwell-intl-backup/other.tar
gs://cromwell-intl-backup/pictures-2003.tar
gs://cromwell-intl-backup/pictures-2004.tar
gs://cromwell-intl-backup/pictures-2005.tar
gs://cromwell-intl-backup/pictures-2006.tar
gs://cromwell-intl-backup/pictures-2007.tar
gs://cromwell-intl-backup/pictures-2008.tar
[...many more lines...]
[cromwell@desktop ~]$ gsutil ls -l gs://cromwell-intl-backup/
     71680  2022-02-26T17:51:27Z  gs://cromwell-intl-backup/bind-dhcp.tar
   1349768  2022-02-26T17:51:40Z  gs://cromwell-intl-backup/desktop-etc.tar.xz
 141362588  2022-02-26T17:56:13Z  gs://cromwell-intl-backup/journal-notes.tar.xz
   1466928  2022-02-26T17:52:08Z  gs://cromwell-intl-backup/laptop-etc.tar.xz
 672714936  2022-02-26T18:15:02Z  gs://cromwell-intl-backup/laptop-thunderbird.tar.xz
27288340480  2022-02-27T11:30:45Z  gs://cromwell-intl-backup/other.tar
 119285760  2019-01-30T22:14:57Z  gs://cromwell-intl-backup/pictures-2003.tar
2072709120  2019-01-30T23:04:11Z  gs://cromwell-intl-backup/pictures-2004.tar
2280632320  2019-01-30T23:58:16Z  gs://cromwell-intl-backup/pictures-2005.tar
9215580160  2019-01-31T03:37:12Z  gs://cromwell-intl-backup/pictures-2006.tar
5336176640  2019-01-31T05:44:09Z  gs://cromwell-intl-backup/pictures-2007.tar
5995212800  2019-01-31T08:06:30Z  gs://cromwell-intl-backup/pictures-2008.tar
[...many more lines...]
[cromwell@desktop ~]$ gsutil ls -lh gs://cromwell-intl-backup/
    70 KiB  2022-02-26T17:51:27Z  gs://cromwell-intl-backup/bind-dhcp.tar
  1.29 MiB  2022-02-26T17:51:40Z  gs://cromwell-intl-backup/desktop-etc.tar.xz
134.81 MiB  2022-02-26T17:56:13Z  gs://cromwell-intl-backup/journal-notes.tar.xz
   1.4 MiB  2022-02-26T17:52:08Z  gs://cromwell-intl-backup/laptop-etc.tar.xz
641.55 MiB  2022-02-26T18:15:02Z  gs://cromwell-intl-backup/laptop-thunderbird.tar.xz
 25.41 GiB  2022-02-27T11:30:45Z  gs://cromwell-intl-backup/other.tar
113.76 MiB  2019-01-30T22:14:57Z  gs://cromwell-intl-backup/pictures-2003.tar
  1.93 GiB  2019-01-30T23:04:11Z  gs://cromwell-intl-backup/pictures-2004.tar
  2.12 GiB  2019-01-30T23:58:16Z  gs://cromwell-intl-backup/pictures-2005.tar
  8.58 GiB  2019-01-31T03:37:12Z  gs://cromwell-intl-backup/pictures-2006.tar
  4.97 GiB  2019-01-31T05:44:09Z  gs://cromwell-intl-backup/pictures-2007.tar
  5.58 GiB  2019-01-31T08:06:30Z  gs://cromwell-intl-backup/pictures-2008.tar
[...many more lines...]

Use -L instead of -l for lots more information in JSON format.

Create a Google Cloud Platform account, and then do these two "quickstart" tutorials:

Google Cloud Storage Quickstart:
Using the Web Console

Google Cloud Storage Quickstart:
Using the Command Line

Be patient. An Internet speed test gives you a report in Mbps, megaBITS per second. Your file sizes will be reported in megaBYTES. If you're not careful, you will be disappointed to find that a transfer takes 8 times as long as you expected.

Network speeds are reported in powers of ten. File sizes, however, might be reported in powers of two or in powers of ten. One gigabyte is 10⁹ bytes, or 1,000,000,000. One gibibyte, however, is 2³⁰ bytes, or 1,073,741,824. You can do the math yourself, or you can ask the tool to do the math (including rounding off!) for you.

$ ls -l archive.tar.xz
-rw-rw-r--. 1 cromwell cromwell 8397185024 Feb 21 15:26 archive.tar.xz
$ ls -lh --block-size=MB archive.tar.xz
-rw-rw-r--. 1 cromwell cromwell 8398MB Feb 21 15:26 archive.tar.xz
$ ls -lh --block-size=MiB archive.tar.xz
-rw-rw-r--. 1 cromwell cromwell 8009MiB Feb 21 15:26 archive.tar.xz
$ ls -lh --block-size=GB archive.tar.xz
-rw-rw-r--. 1 cromwell cromwell 9GB Feb 21 15:26 archive.tar.xz
$ ls -lh --block-size=GiB archive.tar.xz
-rw-rw-r--. 1 cromwell cromwell 8GiB Feb 21 15:26 archive.tar.xz

A 100 GB archive will take almost 5 days to upload over a 2 Mbps link:

100 GB × 1,073,741,824 ^bytes/_GB × 8 ^bits/_byte = 858,993,459,200 bits

858,993,459,200 bits / 2,000,000 ^bits/_second = 429,497 seconds

429,497 seconds / 3,600 ^seconds/_hour = 119.3 hours = 4 days, 23 hours, 18 minutes

Doing the algebra for you — If you have an X-gigabyte archive to upload over a Y-Mbps connection, it will take about 2.386×X/Y hours.

Smaller archives will be easier to handle. I have organized my pictures into directories named, for example:
Pictures/pictures/alaska-2017-11-nov/ Pictures/pictures/japan-2017-04-apr/ Pictures/pictures/los-angeles-2017-08-aug/ Pictures/pictures/new-york-2017-03-mar/ [...]
I can easily make one archive per year. That's much easier to manage than one gigantic archive containing all my pictures since I first bought a digital camera in 2003!

Archiving data in AWS Glacier

Amazon's Glacier storage service may cost slightly less for some categories and sizes. But, its interface is awkward and slow. I'll happily pay a little more to Google and have a much better interface.

How awkward is AWS Glacier? Let's look at deleting last year's archive. The archive is a storage blob, an object, within a vault. Hopefully you gave the vault a name that tells you specifically what's in it.

You try to delete the vault, and it tells you that you can't, as it contains something. That's reasonable. That's cautious design. But maybe I am certain that I really want to delete the archive in the vault and the vault itself.

Oh, but I can't just do that. I have to tell it to delete the archive object by name. Fine, what's its name? I have to ask, it takes 4 hours to retrieve the name of the archive object. And, I have to use a tool outside the dashboard interface.

Four hours later I have its enormous name. Something like this:

HuGGPbf47imi6O39u06zPUuk3urnKFPMv7TqjYpeCPgPY6dzTMi4d2D5-w4ZOwI7No0vA3HftNAa1ow05SxhiA4ffLXogAzzSmGeUTJ0otbNCS6XtRMMAlArSrNVfQjEzUFpP01X8G

Now I can delete the empty vault, right? Well, no. I have to wait at least another hour or two. Probably until some time tomorrow. Only then I can delete it.

If you still want to do this, tools to interact with Glacier include:
Boto, an AWS SDK for Python HashBackup, command-line for Linux, macOS, and BSD CrossFTP, graphical client for Linux, macOS, and Windows SAGU or Simple Amazon Glacier Uploader, graphical client for Linux, macOS, BSD, and Windows mt-aws-glacier, Perl multi-threaded multi-part sync tool

Boto is useful for an initial test. Upload a small (1-3 GB) archive to make sure it works and to get an idea for the amount of time required per upload.

Do your uploading with SAGU. It opens a window that it describes as a "progress bar" but it gives you no idea of speed or amount of progress, just that it's trying to do something. Instead, watch your outbound network utilization.

I saved the SAGU Java archive file and then created this shell script in ~/bin/glacier-sagu. I put the actual Access Key ID and Secret Access Key in the file, so ownership and permissions are crucial!

#!/bin/sh

java -jar ~/bin/SimpleGlacierUploaderV0746.jar & > > /tmp/sagu-output 2>&1
cat << EOF
Access Key ID: A⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀
Secret Access Key: B⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀⌀
EOF

Expect the occasional error message like the following. It should be OK, the announced retry should work:

Aug 10, 2025 9:57:23 AM org.apache.http.impl.client.DefaultHttpClient tryExecute
INFO: I/O exception (java.net.SocketException) caught when processing request: Broken pipe
Aug 10, 2025 9:57:23 AM org.apache.http.impl.client.DefaultHttpClient tryExecute
INFO: Retrying request

SAGU or Simple Amazon Glacier Uploader main window.

The first image is the main SAGU window. The second is the "progress bar" window, which is neither exciting nor informative.

SAGU or Simple Amazon Glacier Uploader 'progress bar' window.

Here is the result of signing into AWS, selecting one of my vaults, and viewing the details.

Archive Resilience

How can cloud providers claim such high availability for their cloud storage?

According to their descriptions of their infrastructure, Google and Amazon storage services store multiple copies of your data, in multiple physical locations. Typically at least three copies in at least two physical locations. The hash value of each is periodically calculated. If any one is ever found to differ from the other two, it is recreated from the presumed good pair.

Meanwhile, those stored data objects are periodically re-written onto different physical storage media. And the underlying storage devices are rotated out of service after a specified period of time.

This process of periodic rewriting onto reasonably fresh hardware and comparing cryptographic hash values for the three current copies is designed, as per Amazon, to provide an average annual durability of 99.999999999% for an archive of data.

This estimate is based on the details of their design (frequency of disk replacement, frequency of re-writing the archive copies) and the probabilities of the scenario and environment (probability of RAID array failure leading to data loss, likelihood of cryptographic hash collisions).

Availability topics with their own pages:

On the general Availability page:

Back to the Security Page