Shuming Ma
马树铭
Research on LLM pretraining, model architecture, and reasoning.
I work on large language models with an emphasis on scalable pretraining, efficient architectures, and reasoning. Recent projects include BitNet, bitnet.cpp, Q-Sparse, TorchScale, LongNet, and DeepNet.
News
2025
Introduced LongReasonArena, a benchmark for long reasoning that scales tasks to as much
as 1 million reasoning tokens.
[paper]
2025
Released the BitNet b1.58 2B4T technical report for an open-source native 1-bit LLM at
the 2B scale trained on 4 trillion tokens.
[tech report]
[huggingface]
2025
Published Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning on
more efficient test-time scaling.
[paper]
Selected Publications
For a full publication list, see Google Scholar.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
2024
Introduces BitNet b1.58, showing that ternary 1-bit transformers can match full-precision baselines with substantially lower cost.
BitNet: Scaling 1-bit Transformers for Large Language Models
2023
Introduces BitNet, a scalable and stable 1-bit transformer architecture for large language model pretraining.
BitNet b1.58 2B4T Technical Report
2025
Presents the open-source 2B native 1-bit LLM trained on 4 trillion tokens and released with model weights.
bitnet.cpp: Efficient Edge Inference for Ternary LLMs
2025
An inference system for ternary and 1-bit LLMs with optimized kernels for efficient, lossless edge deployment.
Q-Sparse
2024
Training LLMs with fully sparsely-activated linear transformations for more efficient inference.
TorchScale: Transformers at Scale
2022
An open-source toolkit for scaling transformers, including architectures such as DeepNet and LongNet.
LongNet: Scaling Transformers to 1,000,000,000 Tokens
2023
Introduces dilated attention to scale transformer context length to more than 1 billion tokens.
DeepNet: Scaling Transformers to 1,000 Layers
2022
Introduces DeepNorm and initialization strategies that stabilize extremely deep transformer training.