Self-Hosting a Vision-Language Model for Privacy-First Document Intelligence

Every year, millions of Americans applying for SNAP (Supplemental Nutrition Assistance Program) food assistance get stalled at the document submission stage, waiting days for a human to review what they sent. mRelief, a nonprofit that has helped millions of people access public benefits, was spending 23 staff hours per week on that review alone.
Automating this step meant classifying documents across 35 categories and extracting structured data from each one. The harder constraint: these documents (IDs, financial statements, immigration records. etc.) could not leave mRelief's infrastructure. Sending images to a third-party API means your data exits your VPC (Virtual Private Cloud) the moment the request fires, taking the audit trail with it and introducing third-party risk into every compliance conversation.
mRelief came to the table well-prepared: their data workflow was already production-grade, built around a strict 60-day rolling window that shaped our pipeline design from day one. They had real applicant data, which made our accuracy numbers meaningful (no drifts between experimentation and production). When annotation patterns looked off, their domain intuition helped us distinguish genuine model errors from labeling noise, which directly informed our experimentation process.
We delivered two pipelines: Gemini 2.5 Flash as the default, and a fully self-hosted, open-source pipeline for workflows where data must stay within mRelief's own environment.
Getting the AI Right
Selecting the right model for a self-hosted pipeline is a different problem than selecting one for an API. Production behaviour matters as much as benchmark numbers: how the model holds up under constrained decoding, whether structured output is reliable at scale, how quantization affects real task performance rather than synthetic benchmarks.
We ran a systematic search across model variants and quantizations, tracked in MLflow (an open-source platform for tracking ML experiments), and landed on Qwen3-VL-8B-FP8. The choice came down to the numbers. A competing configuration performed within ~3% on accuracy metrics, but the model had just been released and its inference framework support was still catching up, so structured output behaviour was unstable enough to rule it out for production. A marginally higher benchmark score isn't worth a model that occasionally misbehaves when you need reliable, structured responses at scale.
We tested multiple quantization levels : Q4, Q8, FP8… (think of it as a compression technique) for different model sizes (8b, 4b, 2b …). The key learning was that more aggressive quantization on a larger model consistently outperformed lighter quantization on a smaller one. FP8 shrinks the model from ~17GB to ~10GB with almost no accuracy loss, leaving enough GPU memory headroom for the instance to handle multiple concurrent requests on a single GPU.
The result: 85–88% top-1 accuracy and 92–95% top-2 , near-parity with Gemini on classification. The gap surfaces on degraded scans and complex layouts where Gemini's visual grounding is stronger. On high-sensitivity document types, such as IDs and social security cards, the open-source pipeline matches Gemini entirely.
Getting the Infrastructure Right
We deployed on Cloud Run GPU with an NVIDIA L4 and scale-to-zero billing, the right fit for sparse, unpredictable traffic where always-on GPU cost doesn't make sense. The tradeoff is cold starts and a painful baseline: roughly 11 minutes end-to-end. Many optimizations brought that to 130–150 seconds.
VPC placement was the biggest win and the cheapest. Moving the Cloud Run service and GCS (Google Cloud Storage) model bucket into the same VPC dramatically increased available download bandwidth. Without it, GCS traffic is throttled through standard network paths. Inside the VPC, it's fast. Cost: cents per month.
Disabling CUDA graph compilation in vLLM means lower peak throughput on warm instances, but when cold start latency is the bottleneck, it's a straightforward call, once you know the lever exists.
Run:ai Model Streamer parallelizes the GCS download and streams weights directly into GPU memory, eliminating the download-to-disk-then-load step that was costing tens of seconds. We tried building our own solution first. Model Streamer outperformed it, so we didn’t invest more time in addressing a problem that had already been solved.
Two things that didn't work: baking model weights into the image added deployment overhead that outweighed any cold start savings, and GCS Anywhere Cache produced no measurable improvement.
What This Actually Takes
There's a version of this project where you generate a working vLLM container, deploy it, and call it done. The cold starts stay at 11 minutes. Model selection follows a leaderboard. The infrastructure optimizations that actually moved the needle stay undiscovered.
The judgments that mattered here weren't just technical, they were experiential. Choosing production stability over benchmark gains, understanding how quantization behaves on real document tasks rather than synthetic evals, diagnosing where the actual bottlenecks were, finding that a VPC change costing pennies dwarfs every other optimization. These are not outputs you get from vibe-coding. They come from running the experiments, knowing what the results mean, and having the pattern recognition to ask the right questions in the first place.
An experienced practitioner working with AI tools is a genuinely different combination than either alone. The AI accelerates hypothesis generation, automates the tedious instrumentation, and surfaces options faster than any individual can explore manually. The human brings the judgment to know which options are worth pursuing, the instinct to distrust a benchmark that looks too clean, and the accumulated context to recognize when a "working" system isn't actually production-ready. That combination is what this project required.
On the client side, mRelief stood out: they moved fast and pushed hard. They integrated our API into production quickly, tested against real workflows, and consistently surfaced new requirements and features (handwriting detection, document counts per image, privacy pipeline) that sharpened the final system. That kind of engagement and technical curiosity made a real impact in the quality of the system delivered.
For the non-profit organization, that translates to a document intelligence system targeting 75% automation of a process that was consuming 23 staff hours a week, with a self-hosted path that keeps sensitive applicant data entirely within their own infrastructure, auditable end-to-end, with no third-party model provider in the chain. $2.3 billion in benefits unlocked and 2.9 million households served - A real impact.
Ready to build? Get in touch.
