/posts/mlx-lm-local-llms-apple-silicon
MLX-LM: Five Commands for Local LLMs on Apple Silicon
TL;DR
MLX trades GPU/CPU memory separation for a unified address space, eliminating copies that kill throughput on consumer hardware. The mlx-lm CLI wraps this into generate, convert, lora, fuse, and manage -- five operations covering the full local LLM workflow from download to fine-tuned inference.
Thoughts
MLX is Apple's array framework for machine learning on Apple silicon. The central idea is unified memory: the CPU and GPU share the same physical address space, so data never copies between them. On a MacBook with 32GB of RAM, you can load a 7B parameter model quantized to 4-bit (~4GB), run inference, and fine-tune with LoRA -- all without the memory bottleneck that forces datacenter ML onto expensive, segmented hardware. The mlx-lm package wraps this into a CLI with five commands: generate, convert, lora, fuse, and manage. Add server for an OpenAI-compatible API endpoint. Everything else -- gradients, vectorization, compilation, neural network layers -- lives in the core mlx library and follows NumPy/PyTorch conventions closely enough that the learning curve is shallow.
Apple Silicon Has One Memory Pool
Every major ML framework assumes a hard wall between CPU memory and GPU memory. PyTorch moves tensors explicitly with .to("cuda"). The CUDA runtime manages pinned buffers and DMA transfers. On datacenter hardware with PCIe bandwidth of 32-64 GB/s, that wall is survivable. On Apple silicon, the wall does not exist -- the M-series chips use a single DRAM pool accessed by both the CPU cores and the GPU cores (which Apple calls the Neural Engine for the ANE variant). MLX is built around that physical fact. When you create an mx.array, it lives in unified memory and any compute unit can operate on it without a copy.
The consequence is that memory you see in Activity Monitor is the budget for your model, your activations, and your OS simultaneously. A MacBook Pro M3 Max with 128GB is a serious inference machine for models up to 70B parameters. A MacBook Air M2 with 8GB runs 3B models comfortably and 7B models slowly.
The second bet is lazy evaluation. MLX does not execute operations immediately. It builds a computation graph and materializes results only when you call mx.eval(). This lets the runtime fuse operations, eliminate redundant passes, and schedule compute across the available engines. The tradeoff: bugs that would surface immediately in eager frameworks show up at mx.eval() instead.
Five CLI Commands, Two Python Imports
mlx.core-- the array primitive and all transforms (grad,vmap,compile)mlx.nn-- modules, layers, losses; follows PyTorch'snn.Modulepatternmlx.optimizers-- SGD, Adam, and variants; stateful, updated viaoptimizer.update(model, grads)mlx-lm-- CLI and Python API for language models: generate, convert, fine-tune, fuse, serve- Swift bindings (
mlx-swift) -- same primitives accessible from Swift/Xcode for on-device apps
The function transform API is the part that surprises PyTorch users:
import mlx.core as mx
def f(x):
return mx.sum(x ** 2)
# First-order gradient
grad_f = mx.grad(f)
# Vectorize over a batch dimension
vector_f = mx.vmap(f)
# Compose: vectorized gradient
grad_vector_f = mx.grad(mx.vmap(f))
x = mx.array([1.0, 2.0, 3.0])
mx.eval(grad_f(x)) # [2.0, 4.0, 6.0]
mx.grad returns a function, not a tape object. mx.vmap returns a function. You compose them. This matches JAX's transform model more than PyTorch's autograd.
Compilation is opt-in and transparent:
@mx.compile
def optimized_fn(x):
return mx.sum(x ** 2)
@mx.compile traces the function on first call and caches the compiled graph. Subsequent calls skip re-tracing unless the shape changes -- and because MLX supports dynamic shapes, shape changes do not invalidate the cache catastrophically the way they do in some static-graph frameworks.
No Copy Between CPU and GPU
In a standard PyTorch + CUDA workflow, the forward pass moves data from CPU to GPU, computes activations in GPU memory, and moves gradients back for the optimizer step. For large models, that round-trip is measured in seconds per batch. MLX eliminates it. The optimizer runs in the same address space as the forward pass.
The practical effect for fine-tuning: LoRA on a 7B model with mlx-lm.lora uses 8-12GB of unified memory and trains at 300-500 tokens/second on an M2 Pro. The equivalent CUDA workflow on a single consumer GPU (RTX 4090, 24GB VRAM) runs at similar speeds but requires the model to fit entirely in VRAM -- no overflow to system RAM. The MLX model can overflow to unified memory gracefully, at the cost of bandwidth.
Generate, Convert, LoRA, Fuse, Manage
- generate -- run inference against a local or Hugging Face model; supports streaming, temperature, top-p, and adapter paths for fine-tuned models
- convert -- download a Hugging Face model, convert weights to MLX format, optionally quantize to 4-bit or 8-bit; can upload the result back to Hugging Face
- lora -- fine-tune a model using LoRA or QLoRA; writes adapter weights to a checkpoint directory; separate
--testflag evaluates a fine-tuned adapter without training - fuse -- merge LoRA adapter weights into the base model weights, producing a standalone model; can export to GGUF format for use with llama.cpp or Ollama
- manage -- scan the local model cache and delete by name pattern; the cache lives in
~/.cache/huggingface/hub/by default
CLI examples: convert, fine-tune, fuse, serve
# Convert and quantize a model to 4-bit
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
# Fine-tune with LoRA
mlx_lm.lora \
--model mistralai/Mistral-7B-v0.1 \
--train \
--data ./my_data_folder \
--batch-size 1 \
--num-layers 4 \
--iters 500
# Fuse adapters and export to GGUF
mlx_lm.fuse \
--model ./path/to/model \
--adapter-path ./adapters \
--save-path ./fused_model \
--export-gguf
# Run OpenAI-compatible server
mlx_lm.server
curl localhost:8080/v1/chat/completions -d '{
"model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"max_completion_tokens": 2000,
"messages": [{"role": "user", "content": "Hello"}]
}'
The server command is not listed as one of the five because it is a wrapper around generate exposed over HTTP -- same parameters, same model loading path.
Past the CLI: Direct mx.array Operations
The mlx.nn module follows PyTorch's nn.Module pattern. Subclass it, define __call__, compose layers:
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
class MLP(nn.Module):
def __init__(self, in_dims, hidden_dims, out_dims):
super().__init__()
self.layers = [
nn.Linear(in_dims, hidden_dims),
nn.Linear(hidden_dims, out_dims)
]
def __call__(self, x):
for layer in self.layers[:-1]:
x = mx.maximum(layer(x), 0) # ReLU
return self.layers[-1](x)
model = MLP(10, 128, 1)
mx.eval(model.parameters())
def loss_fn(model, x, y):
return nn.losses.mse_loss(model(x), y)
loss_and_grad_fn = nn.value_and_grad(model, loss_fn)
optimizer = optim.Adam(learning_rate=0.001)
for x_batch, y_batch in data_loader:
loss, grads = loss_and_grad_fn(model, x_batch, y_batch)
optimizer.update(model, grads)
mx.eval(model.parameters(), optimizer.state)
The mx.eval() call at the end of the training step is the materialization point. Without it, the computation graph grows unboundedly across iterations.
Always call
mx.eval()at the end of each training step. Without it, the computation graph grows across iterations and memory climbs until the process is killed.
Available layers: Linear, Conv2d, LayerNorm, Dropout, MultiHeadAttention, Embedding. The attention primitive is used directly by the transformer architectures in mlx-lm.
No Multi-GPU, No QAT, Lazy Eval Hides Errors
Lazy evaluation means shape mismatches and type errors surface at
mx.eval(), not at the line that caused them. Print intermediate shapes during development.
- No multi-GPU distribution. MLX targets a single unified-memory chip. For multi-node training, you need a different framework.
- No Windows or Linux support for GPU acceleration. The Metal backend is macOS-only. MLX runs on Linux CPU-only, which defeats the purpose.
- No gradient checkpointing in the standard API. For very long sequences, activation memory can grow large; the escape hatch is manual
mx.eval()calls mid-graph to force materialization and release intermediate activations. - Lazy evaluation hides errors. A shape mismatch in a complex graph may not surface until
mx.eval()is called, making debugging harder than eager frameworks. - The model zoo is smaller than PyTorch's. Most Hugging Face models need conversion via
mlx_lm.convert. Models with unusual architectures may fail conversion. - Quantization is post-training only. There is no quantization-aware training in the standard mlx-lm workflow.
MLX, llama.cpp, PyTorch MPS, Ollama
| Axis | MLX | llama.cpp | PyTorch (MPS) | Ollama |
|---|---|---|---|---|
| Throughput (7B, 4-bit, M2 Pro) | 300-500 tok/s | 80-120 tok/s | 60-100 tok/s | 80-120 tok/s (llama.cpp backend) |
| Fine-tuning on-device | LoRA/QLoRA native | Not supported | Possible, slow | Not supported |
| Model formats | MLX native, HF auto-convert | GGUF | HF native | GGUF via llama.cpp |
| API surface | Python + CLI + OpenAI server | C/C++ + bindings | Python | REST API only |
| Debugging | Lazy eval complicates | C stack traces | Eager, debuggable | Abstracted away |
| When it wins | Training + fast inference on Apple silicon | Max compatibility, GGUF ecosystem | Existing PyTorch code, research | Zero-config chat UI |
If you only need inference and you want broad model compatibility without conversion steps, llama.cpp or Ollama is the right choice. MLX wins when you need fine-tuning or when raw inference speed on Apple silicon is the constraint.
mlx-community, LM Studio, Jan.ai
- mlx-community on Hugging Face -- a community mirror that pre-converts popular models to MLX format; Llama 3, Mistral, Phi, Gemma, Qwen variants all available in 4-bit quantized form; models download directly with
mlx_lm.generate --model mlx-community/<model-name> - LM Studio -- the macOS GUI for local LLMs uses MLX as its backend on Apple silicon; the speed difference vs GGUF backends is visible in tokens-per-second display
- mlx-swift -- the Swift binding allows embedding MLX inference directly in iOS/macOS apps; the same unified memory advantage applies; Apple's on-device model APIs (Core ML) are the alternative with a smaller model ecosystem
- Jan.ai -- another local LLM frontend that added MLX backend support in 2024; benchmarks on M3 Max show ~2x throughput vs llama.cpp GGUF for the same model
The install footprint is two packages:
pip install mlx # Core array framework
pip install mlx-lm # LLM CLI and Python API
Swift integration via Package.swift:
dependencies: [
.package(url: "https://github.com/ml-explore/mlx-swift", from: "0.10.0")
]
Reference: MLX docs -- mlx-lm repo -- mlx-community models