For defense contractors handling Controlled Unclassified Information (CUI), federal data residency regulations are absolute. Standard cloud-based large language model (LLM) APIs—including commercial offerings from OpenAI, Anthropic, and Microsoft Azure—are structurally incompatible with air-gapped enclaves. They transmit prompt inputs and document contexts across public WAN boundaries, logging sensitive data on third-party servers.
To capture the massive efficiency gains of AI agents (which can automate up to 66% of manual document review and formatting tasks) while maintaining strict NIST SP 800-171 compliance, enterprises must transition to on-premise, hardware-locked inference clusters. Below, we examine the technical architecture required to run secure local models like Llama-3 70B on physical Nvidia DGX hardware.
The Hardware Base: Nvidia DGX & local VRAM
Running high-performance models locally requires high memory bandwidth and VRAM density. A 70-billion parameter model quantized to 4-bit (or 8-bit) precision requires between 40GB and 80GB of high-speed memory just to fit the weights, plus additional VRAM overhead to maintain a large context window (e.g. 8k to 128k context buffers).
Nvidia DGX nodes or workstation configurations housing multiple RTX cards (such as dual RTX 6000 Ada with 96GB combined VRAM) provide the ideal performance profile. This hardware setup enables local model runtimes to process data at high throughput (tokens per second) across multiple concurrent local users.
The Local Software Stack
Our secure, air-gapped installations standardise on a three-tier software stack:
- Inference Engine: We utilize
vLLMorOllamacompiled locally. These runtimes are configured to block all external telemetry and run entirely offline. - Vector Database: A locally hosted
QdrantorPGVectorcontainer. This database indexes proprietary documents and schemas using local embedding models (likebge-large-en), enabling secure Retrieval-Augmented Generation (RAG). - Agent Orchestrator: Custom Python agent runtimes using frameworks like
CrewAIorAutoGen. These agents coordinate actions and query tools inside the secure local network.
"By hosting both the models and the vector database on your physical premises, your CUI remains locked in your silicon. Trust is no longer a policy promise—it is enforced by physics."
Enforcing NIST SP 800-171 Controls
Deploying local AI enclaves directly aligns with key NIST SP 800-171 cybersecurity families:
- Access Control (3.1): Runtimes are bound to local domain controllers with cryptographic access control lists (ACLs). Only verified local terminals can interface with the AI agent endpoint.
- Audit and Accountability (3.3): All system inputs, outputs, and model logs are stored locally on write-once-read-many (WORM) storage media, providing auditable logs without cloud exposure.
- System and Communications Protection (3.13): AI servers are placed on a completely isolated VLAN with disabled WAN routing, preventing data egress.
Conclusion
Sovereign AI is the only path forward for the Defense Industrial Base. Deploying private models on physical Nvidia DGX hardware allows contractors to leverage state-of-the-art AI automation, secure set-aside contracts, and protect national security assets—all while maintaining absolute compliance.