arXiv cs.AI June 26, 2026 · Papers

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

arXiv:2606.26587v1 Announce Type: cross Abstract: Low-bit floating-point formats and semi-structured sparsity are increasingly supported by modern accelerators, yet combining them for LLM activation compression remains challenging: activations contain input-dependent outliers that dominate block scales in FP4 quantizat

Read original