The control plane for self-hosted inference
NeuronCluster turns the machines you already run into a managed inference fleet: one hub, a gateway layer, and compute nodes - with your data never leaving the perimeter.
Three layers, one system
Each layer has a clear responsibility, so you can scale and operate them independently.
Central Management Hub
The single control plane for your inference fleet. Register and version models, define subnets and routing policies, manage users and roles, and stream live telemetry from every node.
- Model & subnet registry
- Routing and placement policies
- Users, roles and API keys
- Fleet-wide observability
Gateway layer
Stateless gateways accept client traffic over gRPC, HTTP and WebSocket, then reserve and dispatch work to healthy compute nodes. Run one per region or subnet for locality and fault isolation.
- gRPC / HTTP / WebSocket
- Load balancing & reservations
- Per-subnet isolation
- Horizontal scale-out
Compute nodes
Compute nodes execute models on your GPUs and CPUs inside layered OS sandboxes. They sync models from their subnet, run inference, and sign results before returning them.
- GPU & CPU execution
- Automatic model sync
- Sandboxed isolation
- Signed outputs
A control plane and a compute fleet, cleanly separated
Requests flow from your applications through stateless gateways to sandboxed compute nodes. The hub orchestrates the whole topology - while every byte of data stays inside your network.
Your applications
REST / gRPC / WebSocket clients
Central Management Hub
Models · nodes · routing · users · telemetry
Gateway A
Subnet · region 1
Gateway B
Subnet · region 2
Node 1
GPU
Node 2
GPU
Node 3
CPU
How a single inference request flows
From client call to signed result, every step happens inside infrastructure you control.
Submit
Your application sends a request to a gateway endpoint.
Discover
The gateway finds healthy nodes serving the requested model.
Reserve
Capacity is reserved on a node for the task.
Dispatch
The task is streamed to the node and executed in a sandbox.
Sign
The node signs the output to guarantee provenance.
Return
Results stream back to your client - never leaving your network.
Production-grade from day one
Any model, any modality
LLMs, vision, speech and audio, embeddings, classical ML or your own fine-tunes - in TorchScript, ONNX or Safetensors.
Resilient by design
Gateway and node failures are handled gracefully with reservations, retries and health checks - no single point of failure.
Scales horizontally
Add gateways for throughput and nodes for capacity. Subnets keep workloads isolated and placed where they belong.
Ship AI features against infrastructure you own
One integration for every model and modality. Point your SDK at your own hub and start building - no data leaves the network.
- Unified REST, gRPC and WebSocket APIs
- Drop-in SDKs for the stacks your teams already use
- Streaming responses and batch processing
- Self-hosted endpoints - no third-party in the path
import { NeuronCluster } from "@neuroncluster/sdk";
const nc = new NeuronCluster({
endpoint: "https://hub.internal.acme.com",
apiKey: process.env.NC_API_KEY,
});
// Same call, whether the model is an LLM,
// a vision model, or your own fine-tune.
const res = await nc.inference.create({
model: "llama-3-70b-instruct",
input: { prompt: "Summarize this contract..." },
});
console.log(res.output);Bring inference in-house
See how NeuronCluster runs your models on your infrastructure - with the control, economics and compliance posture your organization needs.