// L01 — KRSNA SOC · EXSLERATE V2 / REV 3.0

Krsna SoC.
Made for the Talk-to-Chip era.

Krsna is the SoC. ExSLerate V2 is the neural engine inside it. Together they take aim at one thesis: for LLM inference at the edge, the memory wall is the only wall that matters. Dynamic Neural Compression cuts memory traffic in half, on the fly, with zero accuracy loss. Speech-to-speech runs on-device in real time. Time-to-first-token is measured in milliseconds, not seconds.

Memory traffic reduction · DNC peak
50%
Lossless weight compression
28%
Configurations · M64 → M4096
4
Native precision
INT4 · FP8

One engine. Four configurations.

ExSLerate V2 ships in four configurations of the same neural engine. Same compute tile, same scheduler, same DNC pipeline, same compiler target. What changes between variants is MAC count and on-die memory budget, sized to four different thermal and product envelopes.

Krsna is the SoC product. The underlying IP is licensable separately — see ExSLerate IP licensing →

Apex
M4096
Talk to chip.

Real-time conversational AI for robotics and heavy edge applications. STT, TTT, and TTS in one inference pipeline. Sized for service robots, automotive HMIs, and industrial control surfaces where latency is the contract.

MAC count
4096
Target
Robotics · Automotive · Industrial
Surge
M1024
Edge in flight.

Light edge AI for drones and platforms where every gram and milliwatt counts. Object detection, classification, and on-board SLMs in the same envelope. The variant that goes where a fan cannot.

MAC count
1024
Target
Drones · Aerial · Light edge
Pulse
M256
Pocket inference.

Tuned for the audio-and-display class of consumer devices. Smartwatches with on-device NLU, smart speakers, and any product where the model is a feature shipping in the BOM, not a fallback to the cloud.

MAC count
256
Target
Smartwatch · Smart speaker
Lite
M64
Always on.

The lowest-power inference target in the family. Built for wearables and hearables where the model never sleeps because the battery cannot afford the wake-up cost. Always-on is the feature.

MAC count
64
Target
Always-on wearables · Hearables
// MAC COUNT SCALING

From wearables to robotics — 64× MAC range, log-scaled.

M64Krsna LiteAlways-on wearablesM256Krsna PulseSmartwatch · speakerM1024Krsna SurgeDrones · aerialM4096Krsna ApexRobotics · autoShippedIn flight
MAC count is the headline scale knob. Each variant carries the same ExSLerate V2 microarchitecture — only the cluster topology and memory budget change. First silicon FY26–27.

Two engines. One memory wall, defeated.

The Memory Wall is what gates LLM inference on every edge device shipping today. ExSLerate V2 attacks it from both sides: bandwidth, with DNC; and activation cost, with the Infinite Series Engine that keeps non-linear functions inside the datapath.

/ ENGINE 01

Dynamic Neural Compression

DNC · Tensulator + Tensor Codec

DNC compresses weights and KV-cache at line rate. The Tensulator accumulator bank and the Tensor Codec engine are patented hardware blocks that sit in the data path between DRAM and the compute tiles. Compression happens during the read; decompression happens during execute. The compiler injects DNC operators automatically. Memory traffic drops by 50% at peak context length. There is no accuracy tradeoff to debate, because the operation is lossless.

Weight compressionKV-cache compressionLosslessLine-ratePCT filed
/ ENGINE 02

Infinite Series Engine

High-fidelity non-linear math

SiLU, GeLU, Softmax, Sigmoid, Tanh. The functions that bottleneck most NPU pipelines because they ship them off-die to a CPU SFU. The Infinite Series Engine evaluates them in place using polynomial coefficients that the compiler fits at build time. Output precision tracks BF16 to within rounding, on FP8 silicon, with none of the SFU area cost and none of the CPU round-trip latency.

SiLUGeLUSoftmaxSigmoidTanhNo CPU offload

Built for what ships. Four model families, one stack.

ExSLerate V2 covers the four families of model that show up in real products today. Language, speech, vision, state space. Same compiler, same operator set, same runtime, all the way down to the silicon.

LLM / SLM
Text generation
Llama · Shakti · Qwen · Gemma
Speech AI
STT & TTS
Sruthi · Svara · Moonshine · Whisper
Computer vision
CNN inference
ResNet · YOLO · VGG
State space
Linear recurrence
Mamba · Jamba
[ ↳ ]

Performance data is shared separately. Throughput, latency, power, and per-configuration benchmarks are released under NDA on an engagement basis. For the full performance dossier and FPGA prototype numbers, write to sales@sandlogic.com.

// 05 · MEMORY WALL

How 8 GB of RAM
holds 128K tokens.

LLM serving fails on edge hardware not because compute is short, but because the weights and the KV-cache do not fit. DNC turns that math around. Below is what an 8 GB endpoint actually carries with the compiler putting compressed tensors in RAM.

// DNC TRAFFIC COMPARISON

The DRAM-to-compute bus is where edge LLMs lose.

STANDARD NPU PATHKRSNA · DNC PATHDRAMWeights+ KV-cache(full)Computetiles100% TRAFFICFull tensors cross the bus on every readDRAMCompressedweights+ KV-cacheON-DIETensor CodecTensulatorDNC ENGINEComputetiles~50% DRAM TRAFFIC · DNCCompressed across the bus. Full tensors at compute.

Half the bandwidth in, full tensors out. The DRAM bus carries compressed weights and KV-cache. The Tensor Codec decompresses them on die before they reach the compute tiles. The expensive bus is narrowed; the compute side never sees a compromise.

28%

Lossless weight compression

Measured on Llama 3 8B. Static compression done once at compile time, no runtime tradeoff.

50%

Memory traffic reduction at peak context

Dynamic compression of KV-cache and activation reads, end to end. The bigger the context, the bigger the win.

// CONTEXT EXPANSION

Context window on an 8 GB endpoint

FP8 PRECISION · RAM + SSD swap on an 8 GB endpoint

ModelStandard RAM (baseline)ExSLerate V2 + DNCSSD extensionTotal max context
Llama 3 · 8B0 (OOM)40k tokens+88k tokens128k tokens
Shakti · 2.5B45.4k tokens92k tokens+36k tokens128k tokens
Shakti · 500M32k tokens32k tokensFits in RAM32k tokens

FP8 accuracy. Within rounding of BF16.

Zero-shot, five-attempt evaluation. BF16 baseline on NVIDIA A100 versus FP8 (E4M3) on Krsna · ExSLerate V2. The deltas below are what they look like in practice, not in a marketing slide.

ConfigMMLUSST-2GSM8KCOTPIQAHELLAWINOBoolQLambARC-C
Llama 3.1 8B
A100 · BF16 baseline
65.68%94.00%55.00%82.00%55.00%79.00%78.00%69.00%52.00%69.88%
Llama 3.1 8B
Krsna · ExSLerate V2 · FP8 (E4M3)
62.91%93.00%44.00%84.00%53.00%78.00%80.00%65.00%51.00%67.87%

Built on IREE. Open from frontend to silicon.

Hardware is half the product. The ExSLerate SDK is founded on IREE, the open MLIR-based compiler runtime. Standard dialects in, .vmfb out. No proprietary frontend, no vendor lock, no rewrite of your model.

No vendor lock-in

Models enter the toolchain through standard MLIR dialects, Linalg and TOSA. Anything that targets IREE today will target ExSLerate V2 tomorrow. Your existing toolchain stays put.

Broad frontend compatibility

PyTorch, TensorFlow, and JAX are first-class. The frontend you ship in is the frontend you stay in. No re-export, no rewrite, no parallel model branch.

Flexible deployment

IREE decouples the model graph from the hardware executable. Update one without rebuilding the other. The HAL handles scheduling and runtime; the FlatBuffer carries the deployable.

ExSLerate extensions

Three custom passes ride on top of stock IREE: graph optimization tuned to the tile, DNC injection at the right edges, and quantization for INT4 and FP8 native paths.

// COMPILATION FLOW
User Model
PyTorch · JAX · TensorFlow
↓  Import to MLIR (Linalg / TOSA)
IREE Compiler + ExSLerate Extensions
Graph optimization · DNC injection · Quantization
↓  .vmfb (Virtual Machine FlatBuffer)
IREE-Based ExSLerate Runtime
HAL (Hardware Abstraction Layer) · Scheduling
↓  PCIe / AXI
Krsna SoC · ExSLerate V2 Engine
M64 · M256 · M1024 · M4096

Operator coverage

Engine-accelerated activations

ReLUPReLuSquared ReLUBatch NormLayer NormBlock NormGeLUSiLU / SwishSigmoidTanhPer-Tensor QuantPer-Channel QuantCustom Non-Linear

Supported precision

Native datapath formats

INT4FP8 (E4M3)
// 08 · DEPLOYMENTS

Where Krsna lives.

Each variant of ExSLerate V2 is sized for a different shape of product. Same compiler, same operators, same model files. Four envelopes, four markets.

M4096 · Apex

Robotics & heavy edge AI.

  • Service and receptionist robots
  • Hospital and elderly-care assistants
  • Automotive HMIs and cockpit voice agents
  • Industrial control surfaces with NLU
  • Real-time speech-to-speech translation
M1024 · Surge

Drones & light edge AI.

  • Aerial inspection and survey drones
  • Delivery and logistics drones
  • IoT and security cameras
  • Industrial vision and anomaly detection
  • On-board SLM for autonomous platforms
M256 · Pulse

Smartwatch & smart speaker.

  • Smartwatches with on-device NLU
  • Smart speakers with local intent
  • Home hubs and voice appliances
  • In-ear and audio-first devices
  • Display-bearing wearables
M64 · Lite

Always-on wearables.

  • Hearables and earbuds
  • Fitness and health bands
  • Continuous biometric monitors
  • Always-listening keyword and wake detect
  • Low-power sensor fusion endpoints

The ExSLerate evolution.

ExSLerate V2 is the first generation in market. The roadmap pushes DNC from weight and KV-cache compression today to a second generation that makes 27B-class models fit on cost-effective SOHO server hardware, and a third that enters the data center.

Now available

ExSLerate V2

Endpoint & robotics

DNC Gen 1

Run 8B-class models on edge devices. Weight compression of 28%, memory traffic reduction of 50% at peak context. Ships in the Krsna SoC across four configurations.

Next gen

ExSLerate V3

SOHO server

DNC Gen 2

60% weight compression, 50% KV-cache compression. Targets local 27B-class inference for enterprise RAG on a single 24 GB GDDR6 card. Half the RAM and a third of the memory bus width vs the standard requirement.

Future

ExSLerate V4

Data center

DNC Gen 3

A100-class throughput envelope. 70% compression target. Built for full-rack deployment in sovereign and private clouds. The endgame of the silicon program.

// V3 · 27B-CLASS · SOHO SERVER

Architecture efficiency vs standard requirement.

SpecificationStandard requirementSandLogic ExSLerate V3
Required RAM48 GB GDDR624 GB GDDR62× smaller
Memory bus width384-bit (expensive)128-bitoptimized
Target applicationSOHO / local privacySOHO server · local RAG

Notes. Detailed throughput, latency, power, and per-configuration benchmarks for the ExSLerate V2 engine and Krsna SoC are released under NDA on an engagement basis. For the full performance dossier, contact sales@sandlogic.com. Tensulator and Tensor Codec are subject of a PCT international patent application filed with the Indian Patent Office.

// LET'S BUILD

License Krsna IP. Or run it on the simulator first.