Research & Papers

Mahjax GPU simulator runs 2M steps/second for RL training

JAX-based Riichi Mahjong environment hits 2 million steps per second on 8 A100 GPUs.

Deep Dive

Mahjax is a new GPU-accelerated Riichi Mahjong simulator built entirely in JAX, designed for reinforcement learning research. The game of Riichi Mahjong is a multiplayer, imperfect-information game with high-dimensional state spaces and stochasticity, making it a challenging benchmark for RL algorithms. Previous work relied heavily on supervised learning from human play logs, but Mahjax enables tabula rasa (from scratch) training, similar to the AlphaZero lineage. The authors—Soichiro Nishimori, Shinri Okano, Keigo Habara, Sotetsu Koyamada, Eason Yu, and Masashi Sugiyama—released the environment with a visualization tool for debugging and agent interaction.

Performance is a standout feature: Mahjax achieves throughputs of up to 2 million steps per second under no-red rules and 1 million steps per second under red rules on eight NVIDIA A100 GPUs. This is made possible by JAX's Just-In-Time compilation and full vectorization, allowing massive parallel rollout on GPUs. The environment supports batch processing of many games simultaneously, dramatically speeding up RL training loops. Experiments validated that agents can be trained effectively to improve their rank against baseline policies, demonstrating the utility of the simulator for reinforcement learning.

The work is published on arXiv under AI and Machine Learning categories. By providing a high-throughput, GPU-optimized environment for a complex real-world-like game, Mahjax lowers the barrier for researchers to experiment with scalable RL algorithms. The combination of imperfect information, stochastic outcomes, and multi-agent dynamics makes Mahjong a rich testbed for advancing decision-making AI that could transfer to domains like finance, logistics, and robotics.

Key Points
  • Mahjax achieves up to 2 million steps/second on 8 A100 GPUs under no-red rules, 1M under red rules
  • Built entirely in JAX for full GPU vectorization and Just-In-Time compilation
  • Enables tabula rasa RL training from scratch without human gameplay data

Why It Matters

Accelerates RL research in complex imperfect-information games, mirroring real-world decision-making challenges like finance and logistics.