What I Learned About Small Language Models
How It Started
In my first semester at UFMG's Computer Science program, I joined FutureLab — a research lab inside the DCC department — as an Undergraduate Researcher. The lab was running a partnership with Morada AI, ranked Brazil's #1 AI startup in the Top 100 Startups 2025.
My focus: LLM optimization. Making language models faster, cheaper, and more capable through techniques like QLoRA, PEFT, Knowledge Distillation, and Fine-Tuning.
Then I read an Nvidia article titled "Small Language Models are the Future of Agentic AI" — and something clicked.
The Realization
The dominant narrative in AI has been about scale. Bigger models, more parameters, more compute. GPT-4, Gemini Ultra, Claude — these are impressive, but they come with a real cost: energy, latency, accessibility, and dependency on cloud infrastructure.
The Nvidia article made a different argument: the future of AI agents — systems that reason, plan, and act autonomously — isn't necessarily about the biggest model. It's about the right model for the task, running where it needs to run.
SLMs (Small Language Models) — models in the 1B to 7B parameter range — are becoming surprisingly capable, especially when fine-tuned on specific domains. A 3B model fine-tuned on medical records can outperform GPT-4 on medical tasks. A 1.5B model fine-tuned on financial data can classify transactions faster and more accurately than a generic 70B model.
The implications are significant: models that run on a laptop, a phone, or an edge device. No API costs. No data leaving the device. No internet dependency.
What I Did With That Insight
I didn't just read about it — I started applying it.
At FutureLab, I worked on fine-tuning pipelines using QLoRA (Quantized Low-Rank Adaptation), a technique that lets you fine-tune a large model using a fraction of the memory by training only small adapter matrices instead of the full model weights.
The workflow:
- Start with a base model (Qwen, LLaMA, Mistral)
- Quantize it to 4-bit precision to fit in GPU memory
- Train LoRA adapters on domain-specific data
- Merge adapters back into the base model or use them at inference time
The result: a model that behaves like a domain expert, at a fraction of the cost of full fine-tuning.
I took this directly into Junto — building a QLoRA-finetuned model specialized in Brazilian personal finance, running entirely on-device via MLX-Swift. The same research that was theoretical at FutureLab became a production feature in an iOS app.
The Architecture Shift: MoE
The release of Kimi K2 using a Mixture of Experts (MoE) architecture reinforced something I had been thinking about: the efficiency gains in AI aren't just coming from smaller models — they're coming from smarter architectures.
MoE models route each token through only a subset of the model's parameters, meaning you get the capacity of a large model with the inference cost of a small one. Combined with quantization and fine-tuning techniques, this points toward a future where capable AI is genuinely cheap to run.
Why This Matters
Every trillion-parameter model answering a simple question is a failure of architecture, not a feature. The energy cost, the latency, the infrastructure dependency — none of it is necessary for most real-world tasks.
SLMs done right — fine-tuned, quantized, and deployed at the edge — are more private, more efficient, more accessible, and in many domains, more accurate than generic large models.
I think we're at the beginning of a fundamental shift: from AI that lives in the cloud and costs money per query, to AI that lives on your device and costs nothing to run.
That's the bet I'm making with Junto. And the more I research, the more I think it's the right one.
Gustavo Barra Felizardo
CS Student at UFMG · Researcher @ FutureLab · Founder of Solitus & Junto