EdgeMatrix.
The CUDA of edge AI.
EdgeMatrix has two layers: CORE (compiler + runtime engine) and EdgeFlow (the inference engine where models actually execute). Together they make any model run on any silicon. It's what CUDA is for NVIDIA, but hardware-agnostic by design. +73% throughput vs vLLM on L40s. 193 model architectures pre-tuned. Runs across five third-party silicon platforms — plus Krsna, our co-designed in-house SoC.
CUDA, but unlocked.
CUDA earned its name by doing five things — and locking developers into one silicon vendor to get them. EdgeMatrix does the same five things, without the lock. The comparison is earned, point by point. Here's what each claim means.
Like CUDA: one programming abstraction across the silicon.
CUDA gave developers one mental model that compiled down to every NVIDIA GPU generation. EdgeMatrix gives developers one programming abstraction that compiles down to NVIDIA, AMD, Intel, ARM, and Qualcomm — plus Krsna, the co-designed in-house SoC that targets a deliberate model-family subset.
Proof: 193 models · 5 third-party silicon · 1 in-house · one binary
Like CUDA: the full toolchain, not just a library.
CUDA is compiler (nvcc) + runtime (driver) + libraries (cuBLAS, cuDNN). EdgeMatrix is CORE (compiler + runtime engine) + EdgeFlow (inference engine) + op libraries. Same architectural shape, deliberately.
Proof: L03 platform layer · L01 silicon co-design
Like CUDA: hand-tuned kernels for the silicon.
CUDA's value isn't the language — it's the decade of hand-tuned kernels in cuBLAS, cuDNN, cuSPARSE. EdgeMatrix has the equivalent for each silicon target: hybrid KV-cache, cache-aware scheduling, dynamic dispatch. The +73% vs vLLM is the receipt.
Proof: +73% on L40s · +29% on A100 · vs vLLM 0.10.2
Like CUDA: ships with the model architectures already supported.
CUDA shipped with the operators you needed. EdgeMatrix ships with 193 model architectures pre-tuned — across Transformers, SSMs (Mamba), RWKV, LFMs, CNNs, MoE, VLMs, and diffusion. Bring your own model or pick from the zoo.
Proof: CORE dispatches 8 architecture families × 5 third-party silicon platforms
Unlike CUDA: hardware-agnostic by design.
CUDA's biggest feature is its biggest constraint — it only runs on NVIDIA. That's NVIDIA's moat, and it's developer's lock. EdgeMatrix optimizes natively for NVIDIA *and* AMD *and* Intel *and* ARM *and* Qualcomm — five third-party silicon platforms — plus Krsna, our co-designed in-house SoC. When the workload moves, the runtime moves with it. That's the only respect in which we don't want to be CUDA.
Faster than every open framework.
Benchmarks run on NVIDIA A100 (80 GB) and L40s. EdgeMatrix v0.0.4 vs vLLM v0.10.2 vs TensorRT-LLM v1.0.0. Numbers vary by model family and workload — see research for full disclosure.
EdgeMatrix vs leading runtimes
Same engine on enterprise hardware
Write once. Run anywhere.
EdgeMatrix is the connective tissue. Any of the 193 supported model architectures — Llama, Qwen, Mistral, Shakti, Phi, DeepSeek, Gemma, and more — runs through one binary on any target: NVIDIA, AMD, Intel, ARM, NPUs, or Krsna SoC. No re-quantization. No vendor lock-in. No per-target kernel team.
Built for the production reality.
Hybrid KV cache reuse
Combines prefix-level and entity-level cache reuse to cut recomputation. Lifts tokens/sec by 29-73% over vLLM 0.10.2 and TensorRT-LLM 1.0.0 across the Top-25 enterprise SLMs.
Dynamic compiler optimization
Adapts to model and hardware at runtime. No static configuration. Just-in-time kernel selection based on batch shape, sequence length, and target device.
Hardware-agnostic acceleration
Optimized for NVIDIA, AMD, Intel, ARM. Ready for NPUs, GPUs, FPGAs without re-architecting. Modular runtime — extend to new hardware in days, not quarters.
Quantization without quality loss
INT8 and INT4 quantization across model architectures with no noticeable accuracy degradation. Fits 70B-class models into 24GB device footprints.
Cache-aware scheduling
Maximizes GPU/NPU utilization by routing requests based on cache locality. Higher concurrency, lower latency, no engineering effort from the model team.
Native VLM, MoE, multi-modal
Where vLLM and TensorRT-LLM still don't support VLM architectures, EdgeMatrix runs Shakti-VLM, Qwen-VL, and frontier multi-modal models out of the box.
How the +73% lift actually works.
193 models and counting.
EdgeMatrix's modular runtime ships pre-tuned for the 193 most-used model architectures in enterprise agentic AI — text, VLM, MoE, and multi-modal. New models added in less than a week.
Shakti
6Shakti-2.5B, Shakti-VLM-1B, Shakti-VLM-4B
Llama
12Llama 3, Llama 3.1, Llama 4
Qwen
18Qwen 2, Qwen 2.5, Qwen2-VL
Mistral
9Mistral 7B, Mixtral 8x7B, Codestral
Phi
6Phi-3, Phi-3-Vision, Phi-3.5
DeepSeek
11DeepSeek V2, V3, R1, OCR
Gemma
7Gemma 2, Gemma 3
And many more
124Granite, Cohere, Yi, GLM, Falcon...