Krsna is the SoC. ExSLerate V2 is the neural engine inside it. Together they take aim at one thesis: for LLM inference at the edge, the memory wall is the only wall that matters. Dynamic Neural Compression cuts memory traffic in half, on the fly, with zero accuracy loss. Speech-to-speech runs on-device in real time. Time-to-first-token is measured in milliseconds, not seconds.
Memory traffic reduction · DNC peak
50%
Lossless weight compression
28%
Configurations · M64 → M4096
4
Native precision
INT4 · FP8
// 02 · PRODUCT LINEUP
One engine. Four configurations.
ExSLerate V2 ships in four configurations of the same neural engine. Same compute tile, same scheduler, same DNC pipeline, same compiler target. What changes between variants is MAC count and on-die memory budget, sized to four different thermal and product envelopes.
Real-time conversational AI for robotics and heavy edge applications. STT, TTT, and TTS in one inference pipeline. Sized for service robots, automotive HMIs, and industrial control surfaces where latency is the contract.
MAC count
4096
Target
Robotics · Automotive · Industrial
Surge
M1024
Edge in flight.
Light edge AI for drones and platforms where every gram and milliwatt counts. Object detection, classification, and on-board SLMs in the same envelope. The variant that goes where a fan cannot.
MAC count
1024
Target
Drones · Aerial · Light edge
Pulse
M256
Pocket inference.
Tuned for the audio-and-display class of consumer devices. Smartwatches with on-device NLU, smart speakers, and any product where the model is a feature shipping in the BOM, not a fallback to the cloud.
MAC count
256
Target
Smartwatch · Smart speaker
Lite
M64
Always on.
The lowest-power inference target in the family. Built for wearables and hearables where the model never sleeps because the battery cannot afford the wake-up cost. Always-on is the feature.
MAC count
64
Target
Always-on wearables · Hearables
// MAC COUNT SCALING
From wearables to robotics — 64× MAC range, log-scaled.
MAC count is the headline scale knob. Each variant carries the same ExSLerate V2 microarchitecture — only the cluster topology and memory budget change. First silicon FY26–27.
// 03 · CORE INNOVATION
Two engines. One memory wall, defeated.
The Memory Wall is what gates LLM inference on every edge device shipping today. ExSLerate V2 attacks it from both sides: bandwidth, with DNC; and activation cost, with the Infinite Series Engine that keeps non-linear functions inside the datapath.
/ ENGINE 01
Dynamic Neural Compression
DNC · Tensulator + Tensor Codec
DNC compresses weights and KV-cache at line rate. The Tensulator accumulator bank and the Tensor Codec engine are patented hardware blocks that sit in the data path between DRAM and the compute tiles. Compression happens during the read; decompression happens during execute. The compiler injects DNC operators automatically. Memory traffic drops by 50% at peak context length. There is no accuracy tradeoff to debate, because the operation is lossless.
SiLU, GeLU, Softmax, Sigmoid, Tanh. The functions that bottleneck most NPU pipelines because they ship them off-die to a CPU SFU. The Infinite Series Engine evaluates them in place using polynomial coefficients that the compiler fits at build time. Output precision tracks BF16 to within rounding, on FP8 silicon, with none of the SFU area cost and none of the CPU round-trip latency.
SiLUGeLUSoftmaxSigmoidTanhNo CPU offload
// 04 · FUNCTIONALITY
Built for what ships. Four model families, one stack.
ExSLerate V2 covers the four families of model that show up in real products today. Language, speech, vision, state space. Same compiler, same operator set, same runtime, all the way down to the silicon.
LLM / SLM
Text generation
Llama · Shakti · Qwen · Gemma
Speech AI
STT & TTS
Sruthi · Svara · Moonshine · Whisper
Computer vision
CNN inference
ResNet · YOLO · VGG
State space
Linear recurrence
Mamba · Jamba
[ ↳ ]
Performance data is shared separately. Throughput, latency, power, and per-configuration benchmarks are released under NDA on an engagement basis. For the full performance dossier and FPGA prototype numbers, write to sales@sandlogic.com.
// 05 · MEMORY WALL
How 8 GB of RAM holds 128K tokens.
LLM serving fails on edge hardware not because compute is short, but because the weights and the KV-cache do not fit. DNC turns that math around. Below is what an 8 GB endpoint actually carries with the compiler putting compressed tensors in RAM.
// DNC TRAFFIC COMPARISON
The DRAM-to-compute bus is where edge LLMs lose.
Half the bandwidth in, full tensors out. The DRAM bus carries compressed weights and KV-cache. The Tensor Codec decompresses them on die before they reach the compute tiles. The expensive bus is narrowed; the compute side never sees a compromise.
28%
Lossless weight compression
Measured on Llama 3 8B. Static compression done once at compile time, no runtime tradeoff.
50%
Memory traffic reduction at peak context
Dynamic compression of KV-cache and activation reads, end to end. The bigger the context, the bigger the win.
// CONTEXT EXPANSION
Context window on an 8 GB endpoint
FP8 PRECISION · RAM + SSD swap on an 8 GB endpoint
Model
Standard RAM (baseline)
ExSLerate V2 + DNC
SSD extension
Total max context
Llama 3 · 8B
0 (OOM)
40k tokens
+88k tokens
128k tokens
Shakti · 2.5B
45.4k tokens
92k tokens
+36k tokens
128k tokens
Shakti · 500M
32k tokens
32k tokens
Fits in RAM
32k tokens
// 06 · ACCURACY VERIFICATION
FP8 accuracy. Within rounding of BF16.
Zero-shot, five-attempt evaluation. BF16 baseline on NVIDIA A100 versus FP8 (E4M3) on Krsna · ExSLerate V2. The deltas below are what they look like in practice, not in a marketing slide.
Config
MMLU
SST-2
GSM8K
COT
PIQA
HELLA
WINO
BoolQ
Lamb
ARC-C
Llama 3.1 8B
A100 · BF16 baseline
65.68%
94.00%
55.00%
82.00%
55.00%
79.00%
78.00%
69.00%
52.00%
69.88%
Llama 3.1 8B
Krsna · ExSLerate V2 · FP8 (E4M3)
62.91%
93.00%
44.00%
84.00%
53.00%
78.00%
80.00%
65.00%
51.00%
67.87%
// 07 · SOFTWARE STACK
Built on IREE. Open from frontend to silicon.
Hardware is half the product. The ExSLerate SDK is founded on IREE, the open MLIR-based compiler runtime. Standard dialects in, .vmfb out. No proprietary frontend, no vendor lock, no rewrite of your model.
No vendor lock-in
Models enter the toolchain through standard MLIR dialects, Linalg and TOSA. Anything that targets IREE today will target ExSLerate V2 tomorrow. Your existing toolchain stays put.
Broad frontend compatibility
PyTorch, TensorFlow, and JAX are first-class. The frontend you ship in is the frontend you stay in. No re-export, no rewrite, no parallel model branch.
Flexible deployment
IREE decouples the model graph from the hardware executable. Update one without rebuilding the other. The HAL handles scheduling and runtime; the FlatBuffer carries the deployable.
ExSLerate extensions
Three custom passes ride on top of stock IREE: graph optimization tuned to the tile, DNC injection at the right edges, and quantization for INT4 and FP8 native paths.
Each variant of ExSLerate V2 is sized for a different shape of product. Same compiler, same operators, same model files. Four envelopes, four markets.
M4096 · Apex
Robotics & heavy edge AI.
→Service and receptionist robots
→Hospital and elderly-care assistants
→Automotive HMIs and cockpit voice agents
→Industrial control surfaces with NLU
→Real-time speech-to-speech translation
M1024 · Surge
Drones & light edge AI.
→Aerial inspection and survey drones
→Delivery and logistics drones
→IoT and security cameras
→Industrial vision and anomaly detection
→On-board SLM for autonomous platforms
M256 · Pulse
Smartwatch & smart speaker.
→Smartwatches with on-device NLU
→Smart speakers with local intent
→Home hubs and voice appliances
→In-ear and audio-first devices
→Display-bearing wearables
M64 · Lite
Always-on wearables.
→Hearables and earbuds
→Fitness and health bands
→Continuous biometric monitors
→Always-listening keyword and wake detect
→Low-power sensor fusion endpoints
// 09 · ROADMAP
The ExSLerate evolution.
ExSLerate V2 is the first generation in market. The roadmap pushes DNC from weight and KV-cache compression today to a second generation that makes 27B-class models fit on cost-effective SOHO server hardware, and a third that enters the data center.
Now available
ExSLerate V2
Endpoint & robotics
DNC Gen 1
Run 8B-class models on edge devices. Weight compression of 28%, memory traffic reduction of 50% at peak context. Ships in the Krsna SoC across four configurations.
Next gen
ExSLerate V3
SOHO server
DNC Gen 2
60% weight compression, 50% KV-cache compression. Targets local 27B-class inference for enterprise RAG on a single 24 GB GDDR6 card. Half the RAM and a third of the memory bus width vs the standard requirement.
Future
ExSLerate V4
Data center
DNC Gen 3
A100-class throughput envelope. 70% compression target. Built for full-rack deployment in sovereign and private clouds. The endgame of the silicon program.
// V3 · 27B-CLASS · SOHO SERVER
Architecture efficiency vs standard requirement.
Specification
Standard requirement
SandLogic ExSLerate V3
Required RAM
48 GB GDDR6
24 GB GDDR62× smaller
Memory bus width
384-bit (expensive)
128-bitoptimized
Target application
SOHO / local privacy
SOHO server · local RAG
Notes. Detailed throughput, latency, power, and per-configuration benchmarks for the ExSLerate V2 engine and Krsna SoC are released under NDA on an engagement basis. For the full performance dossier, contact sales@sandlogic.com. Tensulator and Tensor Codec are subject of a PCT international patent application filed with the Indian Patent Office.
// LET'S BUILD
License Krsna IP. Or run it on the simulator first.