/posts/mlx-lm-taskfile

A Taskfile for MLX-LM Workflows

TL;DR

Taskfile turns the mlx-lm CLI surface into a reproducible local workflow: one file, consistent variable defaults, and composable pipeline tasks that chain from data prep through LoRA fine-tuning to GGUF export.

Apr 20, 2025 6 min updated Feb 22, 2026

Thoughts

Running MLX-LM on Apple Silicon is mostly a CLI game -- a handful of mlx_lm.* subcommands covering download, generate, convert, fine-tune, and fuse. The problem is that each invocation carries five or six flags, and the defaults you settled on for temperature, quantization bits, and adapter paths live only in your shell history. A Taskfile wraps that entire surface into a single declarative file: named tasks, shared variable defaults, and pipeline tasks that chain operations end-to-end. Once it's in place, task generate and task complete-finetune do what they say, every time, with the same parameters.

What MLX-LM Gives You

MLX-LM is Apple's framework for running and fine-tuning language models on the Metal compute stack. It ships as a Python package (mlx-lm) that installs a set of CLI entry points:

mlx_lm.generate -- inference with configurable temperature, top-p, max tokens, and optional streaming
mlx_lm.convert -- converts HuggingFace weights to MLX format, with optional quantization and direct HF upload
mlx_lm.lora -- LoRA, DoRA, and full fine-tuning on local JSONL data
mlx_lm.fuse -- merges LoRA adapters back into base weights, with optional GGUF export and HF upload
mlx_lm.chat -- interactive chat REPL against any local model
mlx_lm.server -- OpenAI-compatible HTTP server

Each command has its own flag set. None of them share state. If you want to run inference against the model you just fine-tuned with the adapter path you used during training, you have to pass the same paths twice. A Taskfile solves this by declaring variables once at the top and threading them through every task that needs them.

Variables That Carry Across the Whole File

The Taskfile declares its defaults in a top-level vars block. These are not environment variables -- they live in the Taskfile and can be overridden per invocation with task <name> VAR=value.

vars:
  DEFAULT_MODEL: mistralai/Mistral-7B-Instruct-v0.3
  DEFAULT_QUANT_MODEL: mlx-community/Llama-3.2-3B-Instruct-4bit
  OUTPUT_DIR: ./outputs
  DATA_DIR: ./data
  ADAPTERS_DIR: ./adapters
  MAX_TOKENS: 500
  TEMP: 0.7
  TOP_P: 0.9
  BATCH_SIZE: 1
  LORA_LAYERS: 4
  ITERS: 1000
  HF_USERNAME: your-username

DEFAULT_MODEL is the full HuggingFace model ID used for raw inference and conversion. DEFAULT_QUANT_MODEL is the pre-quantized 4-bit variant used in the quickstart pipeline where you want something that fits in 8 GB of unified memory without a conversion step. ADAPTERS_DIR appears in every fine-tuning task and the fuse task -- change it once, it propagates everywhere.

The env block enables the HuggingFace transfer accelerator, which uses a Rust-backed HTTP client instead of Python's urllib and meaningfully speeds up large model downloads:

env:
  HF_HUB_ENABLE_HF_TRANSFER: 1

The Task Surface

Tasks are organized into five functional groups.

Environment setup handles installation and repo cloning. task install runs pip install mlx mlx-lm. task install-dev installs the development build of mlx with editable mode. task clone-repos clones all three MLX repositories from GitHub.

Model management wraps mlx_lm.manage. task download-model calls huggingface-cli download and accepts MODEL and LOCAL_DIR overrides. task list-models runs mlx_lm.manage --scan to show all locally cached model directories. task delete-model takes a PATTERN variable and removes matching caches.

Text generation has four tasks. task generate runs a single inference pass. task generate-stream adds --stream for token-by-token output. task chat opens the interactive REPL. task server starts the OpenAI-compatible API server on the default port (8080). All four respect MODEL, MAX_TOKENS, TEMP, and TOP_P from the vars block or from the command line.

Model conversion handles the HuggingFace-to-MLX conversion pipeline. task convert is the base task. task convert-quantize calls convert with QUANTIZE=true. task upload-model chains convert, quantize, and HF upload in one shot, using HF_USERNAME and a required REPO_NAME variable.

Fine-tuning is the densest group. task finetune-lora runs standard LoRA training. task finetune-dora switches to DoRA (weight-decomposed LoRA, which tends to generalize better on smaller datasets). task finetune-full fine-tunes all weights without adapters, which requires more VRAM but avoids the adapter overhead at inference time. All three share the same MODEL, DATA_DIR, ADAPTERS_DIR, BATCH_SIZE, LORA_LAYERS, and ITERS variables.

The Pipeline Tasks

The most useful tasks are the two pipeline composites at the bottom of the file.

task quickstart chains install, download-model (using the 4-bit quantized model), and generate with an explanatory prompt:

quickstart:
  desc: Quick setup and test of MLX
  cmds:
    - task: install
    - task: download-model
      vars:
        MODEL: "{{.DEFAULT_QUANT_MODEL}}"
    - task: generate
      vars:
        MODEL: "{{.DEFAULT_QUANT_MODEL}}"
        PROMPT: "Explain what the MLX framework is in one paragraph."

This is the zero-to-inference path: three commands, one invocation.

task complete-finetune runs the full training pipeline: data directory setup, LoRA training, evaluation on the test split, and adapter fusion into a single model directory:

complete-finetune:
  desc: Complete pipeline for fine-tuning a model
  cmds:
    - task: prepare-data
    - task: finetune-lora
    - task: test-finetune
    - task: fuse-model
    - echo "Fine-tuning complete. Fused model saved to {{.OUTPUT_DIR}}/fused_model"

The echo at the end prints the output path, which matters when you immediately want to run generate against the fused weights.

The Two Gaps Worth Filling

The Taskfile as written leaves two things unhandled that bite quickly in practice.

First, PROMPT is defaulted to "Hello, how are you?" -- fine for smoke testing, awkward for anything else. Real workflows keep prompts in files and pass them with --prompt "$(cat prompt.txt)". Add a PROMPT_FILE variable and a generate-from-file task if you run more than a handful of different prompts.

Second, the fine-tuning tasks hard-code --batch-size 1. On an M2 Ultra with 192 GB unified memory you can run batch sizes of 4 or 8 without OOM. On an M1 Air with 8 GB, batch size 1 is correct. The current default is safe but slow on larger machines. Override it at the command line -- task finetune-lora BATCH_SIZE=4 -- or set it once in vars when you know your target hardware.

Adapting the File to Your Setup

Three substitutions get you from the generic defaults to your actual environment:

Set HF_USERNAME to your HuggingFace handle if you plan to use upload-model or fuse-upload.
Set DEFAULT_MODEL to the base model you're working with. Mistral-7B is a sensible default but meta-llama/Llama-3.2-3B-Instruct or microsoft/Phi-3.5-mini-instruct run faster on constrained hardware.
Point DATA_DIR at your JSONL training data. The expected format is one JSON object per line with {"text": "..."} or the instruction-tuning format {"messages": [...]} depending on the model's chat template.

Everything else -- output directories, adapter paths, quantization -- flows from the variables you've already set.

The full Taskfile is available as a gist at gary.info.