There's no denying that Amazon AWS's S3 product is pretty fantastic and very flexible. It's especially interesting when you leverage S3 Select to query your CSV, JSON, or Parquet files.
If you're not familiar, Amazon S3 Select allows you to query your files for data using a simplified SQL syntax. It is SQL syntax but simplified because it only supports a fairly small subset of the SQL language. But that's ok, there's still a LOT that can be done:
- Count queries, to find the number of matching results or number of lines in a file.
- Simple WHERE clauses, to grab a subset of data, to populate a chart on a website for instance.
- Pagination... sort of. This is what this article is about.
Suppose you have a costly query that takes your primary data source minutes or more to run, you might want to cache those results so that it's less painful to use elsewhere. This might be a good case for something like Mongo, DynamoDB, etc. but S3 is also a perfectly viable option.
Say you output a report as a CSV, and store it in S3 to later be displayed as tabular data on a website. That's pretty straightforward. But suppose the file contains hundreds, thousands, or millions of rows. You surely wouldn't want to try to display a file that large in the browser - any sane person would want to paginate that data
S3 Select provides us with the LIMIT clause to specify how many rows we want to select, which gets us halfway there. Anyone who's manually implemented pagination queries knows that it's a function of LIMIT and OFFSET clauses to grab a window of desired size from somewhere within the overall result set.
Unfortunately, S3 Select does not offer the OFFSET clause. ☹️
As it turns out though, the solution isn't too terrible. As long as you're able to inject a row counter or serialized id into your report data when you store it in S3 for later consumption, you can easily replace the OFFSET clause with a WHERE clause:
SELECT * FROM s3Object LIMIT 5 OFFSET 10
(which does not work)
is functionally the same as
SELECT * FROM s3Object WHERE row > 10 LIMIT 5
(which does work)
I've recently implemented this solution for reporting activities in one of the apps I'm responsible for at my day job because certain reporting was taking so long to run that things were timing out for folks, so we've instead moved to generate those asynchronously, and firing a callback when the report completes.
This allows reports that previously wouldn't finish to not only finish but for those that did but were still slow to perform more efficiently when flipping through pages, as otherwise, previously the on-demand requests were run for every pagination page-turn. So if page 1 took say 30 seconds to respond, so did page 2, etc. However, with the asynchronous generation the whole report might take a minute or two to run, but then pagination is speedy as it's just fetching a subset of already prepared rows from a file on S3.
It probably would still be even more performant to populate the report data into a database, but since people download the CSV file, we'd end up generating it or streaming the data in that format at some point anyway so the little bit of performance we give up is an acceptable trade-off.
You could also probably accomplish this with the
ScanRange parameter, providing start and end bytes to define a range of the file to scan. But given that it's based on byte sizes rather than row counts, I'm not sure how you could reliably paginate equal-sized pages of data this way. That would probably be better for processing chunks of data, rather than for display.
Anyway, now you know how to work around the lack of the OFFSET clause with S3 Select. Happy coding!