Research & Papers

VectraYX-Nano: 42M-parameter Spanish cybersecurity LLM with native tool use

Built from scratch on a $25 corpus, this nano model runs sub-second on commodity hardware.

Deep Dive

Researchers from academia have released VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a specific Latin-American focus. The model is built on a 170M-token corpus called VectraYX-Sec-ES, assembled from publicly available sources (including NVD, CVE mirrors, ExploitDB, HackTricks, and OWASP) using an eight-VM pipeline costing only ~$25 USD. The architecture is a compact Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss, and a 16,384-token byte-fallback BPE tokenizer.

A key innovation is curriculum learning with a replay buffer, which produces monotonic loss descent from 9.80 to 2.16 after pre-training on three phases (conversational, cybersecurity, offensive tooling) plus SFT on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces. The model achieves a conversational gate of 0.78±0.05. Native tool invocation is implemented via the Model Context Protocol (MCP), and a LoRA study reveals that the B4 tool-selection floor is a corpus-density artifact—a tool-dense corpus (2,801 examples) raises B4 to 0.145 on the nano model and 0.445 on a 260M mid-tier. The final GGUF artifact is 81 MB (F16) and runs at sub-second TTFT on commodity hardware. The authors release corpus recipes, training scripts, weights, and the B1-B5 benchmark, marking what they claim is the first Spanish-native cybersecurity LLM with end-to-end MCP integration.

Key Points
  • Trained from scratch on a 170M-token Spanish cybersecurity corpus built for ~$25 USD, covering NVD, CVE, ExploitDB, HackTricks, and OWASP.
  • First Spanish-native cybersecurity LLM with native tool invocation via Model Context Protocol (MCP); runs sub-second TTFT at 81 MB (F16) on commodity hardware.
  • Curriculum learning with replay buffer yields monotonic loss descent (9.80→2.16); LoRA study shows tool-selection floor is a corpus-density artifact, not a capacity limit.

Why It Matters

First open-source Spanish cybersecurity LLM with native tool use democratizes automated threat analysis for Latin American security teams on cheap hardware.