Best strategy for downloading multiple files from S3
Im working on a rails app where Users can create a “project”. Project, has many datafiles.
Users upload multiple files direct to Amazon S3 (im using carrierwave).
I'd like Users to have the abililty to download a Projects datafiles as a single zip file. Im trying to figure out the best strategy to implement this feature. Here are the ideas I've come up with so far:
Strategy 1:
Rails creates a zip file and streams the zip to the user. (could use something like rubyzip)
- Pros: Should be relatively simple to implement.
- Cons: I think that the files need to hit my server (not actually 100% sure on this) which could be bad for performance if files are big leading to a poor user experience.
Strategy 2:
A background job later re-downloads the files to my server, creates a zip and reuploads to S3. Users will then be able to download the zip directly from s3 if it exists.
- Pros: Eliminates the need to create the zip file on the fly. Users can pull directly from S3.
- Cons: Any change to files means the zips need to be deleted and recreated. Need to still be able to serve up files if the zip copy doesnt exist on S3.
Strategy 3:
Post processing on the AWS side to create a zip whenever a bucket changes.
- Pros: Seems like a clean way to do it.
- Cons: No idea how to actually implement this and the AWS documentation is pretty poor. Would need a way to have AWS inform my app after processing has been completed. (something similar to a stripe webhook?)
Strategy 4:
Limit users to a single file upload per project, forcing the end users to zip their files.
- Pros: Simplest to implement
- Cons: Worst user experience.
I'd love to hear some input on how you'd go about this, or any resources for someone that has done something similar.
This is a good challenge Justin! I've thought about this in the past but never needed to implement it.
A lot of this probably will come down to your specific use case, but here are some thoughts:
- You could probably use a mix-and-match approach to use one strategy for say, downloads < 100MB and do like Strategy #1 for that and then if it's > 100MB you could do a different process.
- I don't feel like it would make sense to automatically create zip files whenever a bunch changes.
- I know that Dropbox will initialize a download as a background job and email you a link to it when it's finished. This could be a good approach, and you could dynamically spin up an EC2 server to run that download and zip, then kill the server afterwards. This is obviously useful if you're doing larger downloads that might need a large amount of disk space or ram for compressing. You might be able to use AWS Lambda for this instead of a full EC2 server, but it depends on the resources you're allowed in Lambda.
Dropbox takes the background process approach, while Slack takes the approach of no zipping files and just doing one at a time.
It's hard (read: impossible) to come up with a single solution to this without knowing more details like the average file sizes, types and percentages of requests, and things. If people are uploading small documents that's one thing. If they're uploading videos, that's another.
If you don't know the average usage before you get into this, then I would strongly encourage you to take the simplest approach first (rubyzip) and implement that. You could run this all on the webserver at first and then move it to a dedicated server for processing if it starts taxing your webserver. You can measure the free disk space before a job and what you'll need for the zip before executing it. That could tell you to either run it locally or on a dedicated EC2 instance for a few minutes. Once you get a user that's breaking the capabilities of one solution, then you can implement a more complex one and scale that out as you gather more usage measurements.
This will also come down to keeping an idea of what percentage of requests are uploads vs downloads. If you have lots of zip downloads, then re-creating the zip every change might make sense for speed of downloads. If it's mostly uploads and rarely downloads, then you can do that as-needed instead and maybe cache the zip file until a change is made to the bucket.
It'll be super interesting to see, but the nice part is that as long as you store metadata for all the files in your database, you can get a quick detail of the average file size, largest file, and so on to help you analyze which will be the best approach.
Do you have an idea of what usage you're expecting?
[Edit: I accidentally wrote a mini-novel 🤓]
Chris, thanks for the very detailed response.
Honestly not 100% sure what my usage case will be just yet, though I estimate that it'll be mostly small files (pdfs, sketchup files, CAD/CAM files, images, etc.). So I think I may be in the realm of best being the enemy of good enough. With that in mind, I'll likely stick with option 1 for now, and then proceed to monitor performance and make adjustments as needed.
One thing that surprised me somewhat when doing research is that there isn't a whole lot out there on this topic, so perhaps its not really a big deal.
Thanks for the advice!
Looking forward to hearing about your solution and how it goes once you get it implemented. There really isn't much information on this out there and definitely should be more!
Just to close the loop on this, I'm using the zipline gem to stream dynamically generated zip files to the user. Files uploaded directly to s3 via shrine. The downloads controller either spits out a zip if a project has multiple attachments, or serves up a single file otherwise. Seems to be working well so far.
downloads_controller.rb:
.
.
include ActionController::Streaming
include Zipline
def new
attachments = @project.attachments.all
if attachments.count > 1
files = attachments.map{ |file| [file.fileupload, file.fileupload.original_filename]}
name = @project.title.parameterize
zipline(files, "#{name}.zip")
elsif attachments.count == 1
file = attachments.first
redirect_to file.fileupload.url(download: true)
else
flash[:alert] = "There are no files available to download"
redirect_to user_project_path
end
end
.
.
.
Sweet, I haven't used zipline myself but that sounds awesome.
Are you primarily working with small file sizes? I would imagine this could get problematic if someone tried to zip a gigabyte or something, but hopefully that's not an issue you have to worry about yet.
Yes, mostly small files at the moment. At some point it probably make sense to have a background job create a zipped version and store that on S3, but that extra bit of complexity is on the back burner for now.
I just stumbled on this thread - gotta say Zipline is Fantastic!
I was previously downloading all my files, zipping them and sending them back to S3. This is so much more efficient.
In my case, approx 100 photos with a zipped size of ~200MB.