NeuronCluster
Platform

The control plane for self-hosted inference

NeuronCluster turns the machines you already run into a managed inference fleet: one hub, a gateway layer, and compute nodes - with your data never leaving the perimeter.

Architecture

Three layers, one system

Each layer has a clear responsibility, so you can scale and operate them independently.

Central Management Hub

The single control plane for your inference fleet. Register and version models, define subnets and routing policies, manage users and roles, and stream live telemetry from every node.

  • Model & subnet registry
  • Routing and placement policies
  • Users, roles and API keys
  • Fleet-wide observability

Gateway layer

Stateless gateways accept client traffic over gRPC, HTTP and WebSocket, then reserve and dispatch work to healthy compute nodes. Run one per region or subnet for locality and fault isolation.

  • gRPC / HTTP / WebSocket
  • Load balancing & reservations
  • Per-subnet isolation
  • Horizontal scale-out

Compute nodes

Compute nodes execute models on your GPUs and CPUs inside layered OS sandboxes. They sync models from their subnet, run inference, and sign results before returning them.

  • GPU & CPU execution
  • Automatic model sync
  • Sandboxed isolation
  • Signed outputs
Architecture

A control plane and a compute fleet, cleanly separated

Requests flow from your applications through stateless gateways to sandboxed compute nodes. The hub orchestrates the whole topology - while every byte of data stays inside your network.

Your applications

REST / gRPC / WebSocket clients

inference request

Central Management Hub

Models · nodes · routing · users · telemetry

orchestrates

Gateway A

Subnet · region 1

Gateway B

Subnet · region 2

dispatch

Node 1

GPU

Node 2

GPU

Node 3

CPU

Everything inside this boundary runs on infrastructure you own
Request lifecycle

How a single inference request flows

From client call to signed result, every step happens inside infrastructure you control.

1

Submit

Your application sends a request to a gateway endpoint.

2

Discover

The gateway finds healthy nodes serving the requested model.

3

Reserve

Capacity is reserved on a node for the task.

4

Dispatch

The task is streamed to the node and executed in a sandbox.

5

Sign

The node signs the output to guarantee provenance.

6

Return

Results stream back to your client - never leaving your network.

Capabilities

Production-grade from day one

Any model, any modality

LLMs, vision, speech and audio, embeddings, classical ML or your own fine-tunes - in TorchScript, ONNX or Safetensors.

Resilient by design

Gateway and node failures are handled gracefully with reservations, retries and health checks - no single point of failure.

Scales horizontally

Add gateways for throughput and nodes for capacity. Subnets keep workloads isolated and placed where they belong.

Developer-first

Ship AI features against infrastructure you own

One integration for every model and modality. Point your SDK at your own hub and start building - no data leaves the network.

  • Unified REST, gRPC and WebSocket APIs
  • Drop-in SDKs for the stacks your teams already use
  • Streaming responses and batch processing
  • Self-hosted endpoints - no third-party in the path
inference.ts
import { NeuronCluster } from "@neuroncluster/sdk";

const nc = new NeuronCluster({
  endpoint: "https://hub.internal.acme.com",
  apiKey: process.env.NC_API_KEY,
});

// Same call, whether the model is an LLM,
// a vision model, or your own fine-tune.
const res = await nc.inference.create({
  model: "llama-3-70b-instruct",
  input: { prompt: "Summarize this contract..." },
});

console.log(res.output);

Bring inference in-house

See how NeuronCluster runs your models on your infrastructure - with the control, economics and compliance posture your organization needs.