Post

The Offline Catalog I Argued Against, and Built Anyway

A week ago I argued against bundling Open Food Facts. PR #371 ships an opt-in offline catalog. What changed, and what didn't, between the two decisions.

The Offline Catalog I Argued Against, and Built Anyway

The first time I watched tofu apply come back green against the live Cloudflare side of the catalog, late at my own desk with the kitchen quiet behind me and a cup of tea I had forgotten about going cold next to the keyboard, what struck me was that I was not supposed to be building this at all. A week earlier I had argued, in some detail, that bundling a subset of Open Food Facts directly was a worse design than letting a small cache fill itself with what users actually look up. PR #371 landed seven days after that post, and the version of it that shipped is, in spirit, the bundle approach I argued against, with enough things changed underneath that I have come to think the new design is the right one.

The feature is an opt-in offline catalog: a downloadable sqlite database of OFF products that lives on the device and resolves searches and barcode scans locally, with the live API as a fallback when the network is willing. The wizard, the chunked install, the build pipeline, the Cloudflare side, and the OpenTofu graph that ties it all together took most of a fortnight of evenings to get right, and there is a fair amount in the design I want to walk through. Before any of that, though, the change of mind, because writing about the new design without revisiting the old one would feel like skipping the part of the story that actually matters.

What I argued a week ago

Last week I wrote about the network resilience work in OpenNutriTracker, and one of the things I said in that post, plainly and at some length, was that I did not think the right shape for offline coverage was to bundle a subset of Open Food Facts directly into the app. The proposal in issue #319 suggested that around 230 MB of bundled data could solve the offline problem in one stroke. I argued the inverse: a small cache that fills itself with exactly the items the person using the app actually looks up, growing from zero to maybe three megabytes over six months, against the 230 MB the bundle approach needed before you factored in hosting and weekly updates.

I want to start this post by saying that the May 4 reasoning was right, and the argument I made there still holds for the version of the question I was asking at the time. Hosting an updated 230 MB blob somewhere reliable, on someone’s project budget, on a cadence that kept it fresh against the upstream OFF database, and shipping it to every install whether the user needed offline coverage or not, is genuinely a worse design than a small cache that grows in proportion to use. I was not wrong about any of those constraints. I was, however, asking the question inside a specific shape, and the shape changed when I let myself look at it from a different angle.

The May 4 argument was for a default that everyone would pay for. PR #371 ships a feature that nobody pays for unless they specifically want it, with the cost story landing at zero on the project’s side, and an upper coverage tier that goes much further than the original proposal ever planned to. Both posts can be honest at once. I want to walk through what changed.

The wizard and the variant matrix

The first shift was the recognition that “bundle a subset” is not a single decision. It is a family of decisions, and the family has axes, and the axes are choices.

The OFF dump has roughly three million product entries in its current form, and most of any subset you choose from it comes down to deciding what you are willing to filter out. Three axes turned out to do most of the useful work. Well-scanned only, written s in the variant code, drops products with fewer than two unique scans when set to 1, which filters out one-off submissions and abandoned edits. Nutrition grade, written n, keeps only rows whose Nutri-Score parses as a to e when set to 1, guaranteeing parseable nutrition data on every row. Recency, written r, drops rows whose last_modified_t is older than the cutoff in years (3, 5, or 10), or keeps everything for any.

Two by two by four is sixteen, and each combination gets its own catalog. The recommended default is s1_n1_r5, which gives roughly 351,000 well-scanned products with full nutrition data modified in the last five years, at about 73 MiB compressed. That is not 230 MB. It is a third of the original proposal’s footprint while covering the rows most people would actually be searching for in everyday use, and it installs in ten to twenty seconds on Wi-Fi.

The largest tier, s0_n0_rany, ships all 3.2 million products at about 520 MiB compressed and 3.7 GB unpacked. The wizard surfaces that figure on the estimate page before the install begins, so anyone choosing the largest tier sees “this will use 3.7 GB” and gets to back out before the bytes start flowing rather than discover it mid-install. Most people, by design, will pick the recommended default and never look at the estimate page again.

The opt-in property is what does the heavy lifting on the framing. The default install of the app is unchanged. The cache-fills-itself layer from last week is still there, doing the same quiet work it always was. If you never tap the offline-catalog tile in settings, none of PR #371 affects you. The bundle approach I argued against was a default that everyone paid for. The bundle in PR #371 is a thing people opt into when they need it, and when they do, they get to choose how much of it.

A free-tier CDN that actually stays free

The cost story turned out, to my own quiet surprise, to land at zero.

The catalog is served from a Cloudflare R2 bucket through a custom domain (catalog.opennutritracker.org), with the entire pipeline running on the Cloudflare free plan and on free GitHub Actions runners on a public repository. There is no monthly bill anywhere in the architecture. I was nervous about that part for several days while I was working through it, because zero-cost stories tend not to survive contact with the actual traffic they have to serve, and I wanted to be sure I was not building the project a bill it would discover later.

The piece that makes it actually work is the cache rule. Cloudflare’s default cache behaviour caches a fixed list of static-asset extensions like .png and .zip and explicitly does not cache HTML or JSON. The catalog serves chunk files named variant.db.gz.part-NN and per-variant variant.manifest.json files, neither of which is on the default-cached list (.part-NN is not a recognised extension, and .json is excluded). Without an explicit rule, every response would be marked dynamic, the edge would never serve a hit, and every download would reach R2 origin and rack up GetObject calls.

The fix is a single Cloudflare ruleset that matches on host (http.host eq "catalog.opennutritracker.org") rather than on extension. Every response from the custom domain is cacheable, regardless of file extension or origin headers. From the edge’s point of view, the catalog host is one big bag of cacheable bytes, and once a chunk has been requested by anyone, it stays warm at the edge for the cache TTL.

The TTL split was where I spent the longest. The edge TTL is seven days, set in OpenTofu as var.edge_cache_seconds = 604800. That is long enough for a chunk that nobody asks about for a few days to still be sitting at the edge when somebody finally does, since Cloudflare does not preferentially evict TTL-fresh content to make room for hotter data. The browser-side Cache-Control: max-age is two minutes, set as var.browser_cache_seconds = 120. That is short on purpose, so that the moment a Saturday rebuild lands and the explicit purge fires, clients converge on the fresh state within minutes rather than carrying stale bytes for hours. Client revalidations land on Cloudflare’s edge, which still holds the chunks under the seven-day edge TTL, and either return 304 Not Modified or the new body without ever touching R2 origin. R2 egress to Cloudflare is free regardless of plan, and the edge cache absorbs the rest, so the request volume costs nothing whichever way it converges.

The 512 MB cacheable-response cap on the free, Pro, and Business plans is the other piece I had to plan for. The largest variant compresses to roughly 520 MiB, which is over the cap as a single file. The build pipeline splits each compressed catalog into chunks of at most 256 MiB, which keeps every individual response comfortably under the limit. The client reads them as one contiguous stream from its point of view, with HTTP 206 range requests for parallel chunked downloads, and Cloudflare caches range responses by default, so the cache hit ratio holds even on resume.

Free public-repo runners doing the build work

The build pipeline runs once a week on a GitHub Actions cron, on a ubuntu-latest runner. Per GitHub’s billing reference, standard GitHub-hosted runners on a public repository are free for unlimited minutes, which is the property the design depends on. There are no -large or -xlarge runner suffixes anywhere in the workflow, no GPU runners, and no self-hosted runners. The catalog pipeline runs entirely on the free standard runner pool from end to end.

The single-job design is the other small piece of the cost story. The build, the OpenTofu plan, the apply, and the cache purge all run inside one job, which means there is no inter-job artefact handoff and therefore no rolling artefact storage to pay for. The intermediate sqlite databases and chunk files exist for the duration of the job and get cleaned up when the runner is recycled. Nothing persists between runs except the bytes already on Cloudflare’s side and the OpenTofu state file.

I want to name something here, because I think it is the kind of thing that gets quietly elided in posts about free-tier infrastructure. The fact that the project budget is zero is not a virtue on its own. What it actually buys is independence. Simon, who built and maintains the project, does not have to sign anyone up to a recurring bill in order for the offline catalog to keep working. The day after this PR merges, the pipeline is not waiting for someone to top up a credit card. That property is the one I was most quietly nervous about during the design, and the one I am quietly happy about now that it is in place.

Infrastructure as a single OpenTofu graph

The Cloudflare side is managed end to end with OpenTofu. The bucket, the custom domain, the cache ruleset, the downstream API tokens, the seven sealed GitHub Actions secrets that publish those tokens into the build workflow, and every individual chunk and manifest in the catalog. Each chunk is modelled as an aws_s3_object resource keyed off filesha256().

The choice of OpenTofu over Terraform was a deliberate one. At my day job we use Terraform for everything, and that is the right tool for the place because the constraints there are stable and the August 2023 license change does not bite anyone in the way it bites an open source project. For this project, though, I wanted the infrastructure code that ships alongside an open source app to be unambiguously open source itself, and the move from MPL 2.0 to the Business Source License takes that property away in a way I was not willing to inherit. OpenTofu, the community fork that came out of exactly that concern, was the obvious answer.

Picking it for this project also gave me a chance to sit with what the fork has actually built in the time since the split, which I had been quietly meaning to do. The S3-native lockfile is one of the additions that makes the R2-only state-locking story work without a DynamoDB stand-in, and it is the kind of small post-fork improvement that makes me glad the wider community took the time to keep building.

The keying is the part that quietly does the most work. A chunk whose content has not changed since the last apply produces no diff, no PutObject call, and no upload. The Saturday cron uploads only the bytes that have actually changed week-over-week, which for the typical week is a small fraction of the total catalog because most of the OFF database does not change in any given seven-day window. tofu plan previews exactly which chunks are about to land before any bytes leave the runner, which is its own quiet reassurance the first few times you watch it run.

State lives in a separate private R2 bucket and is encrypted at rest at the OpenTofu layer with PBKDF2-SHA512 plus AES-256-GCM, with enforced = true so an unencrypted state operation cannot accidentally land. The state file holds the raw values of the downstream API tokens, so the encryption layer means a leaked R2 credential on its own does not yield a usable upload token without the passphrase.

State locking uses OpenTofu’s S3-native lockfile via use_lockfile = true, documented in the OpenTofu S3 backend reference. R2 has supported the conditional-write semantics that lockfile depends on since October 2024, which means there is no DynamoDB stand-in needed to make state locking work. The whole infrastructure layer sits inside Cloudflare and never touches AWS. A trailing unlock job in the workflow handles the case of a runner being killed mid-apply, deleting the lockfile only when its created timestamp is within the run’s window so it cannot clobber a developer apply that happens to be in flight.

The pleasing thing about the chunk-as-resource approach is that adding, renaming, or removing a chunk happens through a code change rather than through a manual upload or a console click. The shape of the catalog is in the repo, the diff is reviewable, and the apply is a function of what the build pipeline produced that week. Most of the small operational anxieties I have had about catalog data over the years come down to “who can change this and how”, and modelling every byte as an OpenTofu resource quietly answers both.

The chunked install on a real phone

The Flutter client reverses the build pipeline. Parallel chunked downloads with HTTP Range support, sha256 verification at both per-chunk and combined levels, gunzip with backpressure so the device’s memory does not balloon while the catalog is unpacking, schema check against the catalog’s schema_version columns, and an atomic rename at the end so the previous catalog is never deleted before the new one is verified.

The decision I am most fond of in the client is that pause and resume actually work. Backgrounding the app, dropping connectivity, swapping between Wi-Fi and mobile data, or tapping the explicit pause control in the wizard all leave the partial install in a state where the next resume picks up at the last completed chunk and byte offset. The per-chunk sha256 verification is the safety net underneath the resume: if the resume happens to pick up an incorrect byte stream for any reason, verification fails at that chunk rather than silently corrupting the catalog. A failed install is recoverable. A silently corrupted catalog, the kind that returns wrong nutrition numbers for months before anyone notices, is the failure mode I most wanted to rule out, and ruling it out felt worth the design effort.

I tested this part on a train during the last week of the work, deliberately backgrounding the app while a 520 MiB chunk was mid-download, putting the phone in my pocket, walking through a tunnel, and unlocking it again on the other side to find the wizard quietly resuming where it had been. There is something about watching that work that I find hard to write about without being slightly soft about it. The progress bar coming back to life on its own, after the network has come back, feels like a small private kindness from the engineering to the person using it. That is the kind of detail I most want this app to keep accumulating.

What a flaky moment shouldn’t break

A handful of smaller decisions catch the failure modes that would otherwise have made the catalog feel fragile in everyday use.

The settings tile that opens the install wizard runs an availability probe before the wizard is allowed to open. The probe is the CheckCatalogAvailabilityUseCase, and it asks Cloudflare whether catalog.opennutritracker.org is reachable and serving manifests right now. If it is not, the tile reads “caching currently unavailable, try again later” and the wizard refuses to open. The point is not to block the user; the point is to fail kindly at the tap rather than wedge the wizard halfway through a download, and to surface the state of the world honestly when the world is briefly not cooperating.

Two consecutive crash-on-boot events auto-disable the catalog. The catalog is not deleted, just quietly turned off, and an explicit re-enable banner appears in settings the next time the app comes up. The reason is that the catalog sits on the boot path, and a corrupted or schema-incompatible catalog could in principle become a crash loop that the user cannot escape from without uninstalling the whole app. Auto-disabling on the second crash means the worst case is a catalog that gracefully steps out of the way and waits for the user to decide whether to bring it back, rather than an app that refuses to open at all on the morning she most needs it.

Schema versioning is split into a major and a minor in a catalog_meta table. A client at major version N installs whatever the CDN serves at the same major, even if the minor has moved forward, which means a stale client is not locked out the moment the server-side schema picks up a new minor. The major version is the line that does not get crossed without a coordinated client release. The minor is everything else, and the everything else is allowed to roll on its own cadence without anyone losing access to their installed catalog.

The FTS5 query sanitiser is a small piece of code I am quietly fond of. SQLite’s FTS5 has a query syntax that includes boolean keywords, and an unsanitised free-form query like “Salt and Vinegar Crisps” would parse the word “and” as a binary operator. The sanitiser drops anything that is not a Unicode letter, digit, or whitespace, plus the four boolean keywords FTS5 recognises, before the query reaches the index. Three regression tests pin this behaviour, so a future refactor that quietly breaks the sanitiser will fail the test suite rather than ship a search box that returns nonsense for any natural-language query containing a connective. It is a small thing. It is also the kind of small thing that decides whether a search experience feels reliable or feels haunted.

On changing one’s mind in public

The May 4 post argued against bundling a subset of Open Food Facts. PR #371 ships a bundled subset of Open Food Facts. Both are honest accounts of where my thinking was at the time I wrote them, and I want to say plainly that I do not see any of this as a contradiction to be resolved. The previous design was right for a question shaped one way; the new design is right for a question shaped a slightly different way; and the work between the two posts was the work of letting myself notice that the shape of the question had changed.

The thing that actually changed, when I look at it now, was less the constraints of the system and more the constraint of the framing. I had been asking “should this be the default?” The answer to that is still no. What I had not been asking was “what would it look like to make it opt-in, ship it for free, and let people who specifically need it choose how much of it they want?” That question has a different answer, and the answer is the catalog feature that landed last week.

I am still thinking about how often that pattern shows up in engineering work, where a previous answer holds inside its original framing and stops holding the moment the framing relaxes. I do not have a clean theory of it. What I have is a quiet sense that the discipline worth practising is sitting with one’s own previous decisions long enough to notice when they have stopped applying, and being willing to write the new version down without dressing up the contradiction as something it is not. Good engineering is rarely the work of being right the first time. It is, more often, the work of being honest about where you were before, and letting the next version of the thinking have room to breathe.

What the catalog feature is actually for, when I think about who I had in mind during the work, is the people for whom an unreliable network is not an occasional inconvenience but a daily texture of how their connection to the world feels. People in rural areas with patchy mobile coverage. People on long commutes through tunnels and underpasses. People whose flats have one corner where the signal works properly and the kitchen is not in that corner, so logging the dinner they cooked is the moment they always remember the network problem they have been living around for weeks. They are the ones who deserve a calorie tracker that does not concede defeat the moment a server stops answering, and the catalog is opt-in because nobody else should have to download three gigabytes of food data they did not ask for. Most of the design decisions in this post are quiet implementations of the same promise, and the rest of the work, the wizard and the cache rule and the chunk graph and the resume semantics, is in service of getting out of the way of that promise.

I am quietly happy this version of the feature exists, and I am quietly grateful to Simon for trusting the project to a contributor who occasionally changes her mind in public about what its design should look like. The next version of the question, the one I am sure is already waiting somewhere in the issue tracker, is one I want to leave myself room to be wrong about too, in the same patient way the last week of work has tried to be.

This post is licensed under CC BY 4.0 by the author.