Amazon Web Services (AWS) S3 is a popular, highly-scalable object storage service. It's used by a lot of big companies, including the one I work for.
Getting data — especially large files — into S3 uses a mechanism called Multipart Uploads. For example, to upload a multi-gigabyte file to S3, you might make a sequence of calls like so:
- CreateMultipartUpload
- UploadPart (1 .. n times)
- CompleteMultipartUpload
On the “complete” call, S3 assembles your parts together to form a single object, that then appears in the bucket. Or, you can call “AbortMultipartUpload” to abandon it, and throw away the parts.
So what's the catch?
The catch is that it's very easy to forget to ever call either CompleteMultipartUpload or AbortMultipartUpload. And if you neither complete nor abort the upload, then any parts you have uploaded just sit around in S3, waiting. Forever. It's relatively hard to see those parts, mind — they don't show up in the regular bucket listing. But they are there, and they are costing you money.
So what's the solution?
Enter s3-upload-cleaner
. Simply put, it scans your buckets looking for stale (that is, started a long time ago) incomplete multipart uploads — the premise being, if you haven't completed an upload after, say, a week, then you never will — and aborts them. Thus, periodically running s3-upload-cleaner keeps your account's multipart uploads under control, and helps keep your bill down.
(I'm a little surprised that this isn't a native feature of S3, and to be honest, I expect that one day, it will be.)
Here it is running for a single bucket, and finding nothing to clean:
$ sudo apt-get install nodejs npm $ npm install s3-upload-cleaner aws-sdk $ export AWS_ACCESS_KEY_ID=… $ export AWS_SECRET_ACCESS_KEY=… $ nodejs ./node_modules/s3-upload-cleaner/example/minimal.js Running cleaner Clean bucket my-bucket-name Bucket my-bucket-name is in location eu-west-1 Bucket my-bucket-name is in region eu-west-1 Running cleaner for bucket my-bucket-name $
The code comes with a minimal bootstrap script, though you are encouraged to use your own if you wish.
To call out of a few of its features:
- it's multi-region aware (it will attempt to process all of your buckets, no matter what region they are in);
- it can be configured to process only some buckets, or only some regions, or only some keys;
- the threshold for what counts as “stale” is configurable — the minimal bootstrap script uses 1 week as the cutoff age;
- when a stale upload is found, it emits logging data in json form;
- it can be run in “dry run” mode, where all the scanning and logging is performed, but the abort itself is not.
Finally, here's an example of one of its log entries:
[ { "event_name": "s3uploadcleaner.clean", "event_timestamp": "1448495889.529", "bucket_name": "my-bucket-name", "upload_key": "bigfile.mpg", "upload_initiated": "1447888220000", "upload_storage_class": "STANDARD", "upload_initiator_id": "arn:aws:iam::123456789012:user/SomeUser", "upload_initiator_display": "SomeUser", "part_count": "135", "total_size": "2831189760", "dry_run": "true" } ]
s3-upload-cleaner typically only takes a few seconds to run, and doesn't need to be run very often, so this makes it perfect to run via a scheduled AWS Lambda function.
You can find the code on github and the package on npm.