What I Learned About Small Language Models

The New LLM Market Landscape

In my first semester in the Computer Science program at UFMG, I joined FutureLab — a research lab inside the DCC — as an Undergraduate Researcher. The lab had a partnership with Morada AI, voted Brazil's #1 AI startup in the Top 100 Startups 2025 ranking.

One of the challenges brought by Morada AI was LLM optimization. Making language models faster, cheaper, and more capable through techniques like QLoRA, PEFT, Knowledge Distillation, and Fine-Tuning.

After months of research, in July I read an Nvidia article titled "Small Language Models are the Future of Agentic AI" — and something clicked.

The Realization

The dominant narrative in AI has been about scale. Bigger models, more parameters. GPT-5, Gemini 3.0 Pro, Claude Opus 4.6 — they are impressive, but they come with a real cost: energy, latency, accessibility, and dependency on cloud infrastructure.

The Nvidia article made a different argument: the future of AI agents — systems that reason, plan, and act autonomously — isn't necessarily about the biggest model. It's about the right model for the task, running where it needs to run.

SLMs (Small Language Models) — models well under 7B parameters — are becoming surprisingly capable, especially when fine-tuned on specific domains. A 3B model fine-tuned on medical records can match GPT-4 on medical tasks. A 1.5B model fine-tuned on financial data can classify transactions faster and more accurately than a generic 70B model.

The implications are significant: models that run on a laptop, a phone, or a small device. No API costs. No data leaving the device. No internet dependency.

What I Did With That Insight

I didn't just read about it — I started applying it.

I worked on fine-tuning pipelines using QLoRA (Quantized Low-Rank Adaptation), a technique that allows fine-tuning a large model using a fraction of the memory, training only small adapter matrices instead of the full model weights.

The result I was after: a model that behaves like a domain expert, at a fraction of the cost of full fine-tuning.

I took this directly into Junto — building a QLoRA-finetuned model specialized in Brazilian personal finance, running entirely on-device via MLX-Swift. The same research that was theoretical at FutureLab became a production feature in an iOS app.

Proof That Efficiency in AI Comes From Smarter Architectures: MoE

The release of Kimi K2 using a Mixture of Experts (MoE) architecture reinforced something I had been thinking about: the efficiency gains in AI aren't just coming from smaller models — they are coming from smarter architectures.

MoE models route each token through only a subset of the model's parameters, which means you get the capacity of a large model with the inference cost of a small one. Combined with quantization and fine-tuning techniques, this points toward a future where capable AI is genuinely cheap to run.

Why This Matters

Every trillion-parameter model answering a simple question is an architecture failure, not a feature. The energy cost, the latency, the infrastructure dependency — none of this is necessary for most real-world tasks.

SLMs done right — fine-tuned, quantized — are more private, more efficient, more accessible, and, in many domains, more accurate than generic large models.

I believe we're at the beginning of a fundamental shift: from AI that lives in the cloud and costs money per query, to AI that lives on your device and costs nothing to run.

That's the bet I'm making with Junto. And the more I research and build, the more I believe it's the right one.