Shuming Ma

马树铭

Research on LLM pretraining, model architecture, and reasoning.

I work on large language models with an emphasis on scalable pretraining, efficient architectures, and reasoning. Recent projects include BitNet, bitnet.cpp, Q-Sparse, TorchScale, LongNet, and DeepNet.

Portrait of Shuming Ma

News

2025
Introduced LongReasonArena, a benchmark for long reasoning that scales tasks to as much as 1 million reasoning tokens. [paper]
2025
Released the BitNet b1.58 2B4T technical report for an open-source native 1-bit LLM at the 2B scale trained on 4 trillion tokens. [tech report] [huggingface]
2025
Published Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning on more efficient test-time scaling. [paper]
2025
Published bitnet.cpp, an inference stack for ternary and 1-bit LLMs aimed at efficient edge inference. [paper] [github]

Selected Publications

For a full publication list, see Google Scholar.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
2024
Introduces BitNet b1.58, showing that ternary 1-bit transformers can match full-precision baselines with substantially lower cost.
BitNet: Scaling 1-bit Transformers for Large Language Models
2023
Introduces BitNet, a scalable and stable 1-bit transformer architecture for large language model pretraining.
BitNet b1.58 2B4T Technical Report
2025
Presents the open-source 2B native 1-bit LLM trained on 4 trillion tokens and released with model weights.
bitnet.cpp: Efficient Edge Inference for Ternary LLMs
2025
An inference system for ternary and 1-bit LLMs with optimized kernels for efficient, lossless edge deployment.
Q-Sparse
2024
Training LLMs with fully sparsely-activated linear transformations for more efficient inference.
TorchScale: Transformers at Scale
2022
An open-source toolkit for scaling transformers, including architectures such as DeepNet and LongNet.
LongNet: Scaling Transformers to 1,000,000,000 Tokens
2023
Introduces dilated attention to scale transformer context length to more than 1 billion tokens.
DeepNet: Scaling Transformers to 1,000 Layers
2022
Introduces DeepNorm and initialization strategies that stabilize extremely deep transformer training.