This is the point where things stop being “I vaguely understand the theory” and become “why is my workflow full of angry red boxes”.
Because even if you already know you want SDXL, Flux, or HiDream, the moment you start downloading files you usually run straight into a wall of unfamiliar labels:
- what is a checkpoint?
- what is CLIP, and what on earth is
clip_l? - what does a VAE do?
- where do LoRAs go?
- why does Flux suddenly want T5XXL as well?
- what is GGUF and why does it come with extra loader nodes?
- what is an FP8 checkpoint, and is it actually simpler?
- why does opening a workflow sometimes feel like detonating a small node grenade?
The goal of this article is to turn that pile into something you can actually work with.
ComfyUI itself is relatively small. The real bulk comes from the models. Installing the app can feel tidy enough. The moment you start collecting checkpoints, VAEs, LoRAs, and text encoders, your storage and memory stop being abstract concepts.
First, understand the four folders you will see all the time
ComfyUI’s own documentation is clear on this: model files usually live under ComfyUI/models/, sorted by type. The four folders beginners run into most often are:
models/checkpoints/models/clip/models/loras/models/vae/
If you only remember one sentence from this section, make it this:
A checkpoint is the main course, CLIP reads the prompt, a VAE handles image encoding and decoding, and a LoRA adds flavour.
What is a checkpoint?
A checkpoint is the main weight package for the core model.
With SD 1.5, SDXL, and many community finetunes, you often download a single .safetensors file, place it in models/checkpoints/, load it with CheckpointLoaderSimple, and you are off.
What is CLIP, and what is clip_l?
CLIP is a text encoder.
Its job is not to draw pixels directly. Its job is to turn your prompt into something the model can actually work with.
With newer workflows, especially Flux-style ones, you will often see separate assets such as:
clip_lt5xxl- dedicated text encoder nodes
What is a VAE?
VAE stands for Variational Autoencoder.
In practice, it handles image encoding and decoding between the model’s latent representation and the actual image you see.
What is a LoRA?
A LoRA is a lightweight finetuning weight.
Instead of replacing the whole model, it nudges an existing one in a specific direction.
One table: what goes where?
| Asset type | What it does | Common file type | Usual location | Typical node |
|---|---|---|---|---|
| checkpoint | main model package | .safetensors | models/checkpoints/ | CheckpointLoaderSimple |
| clip / clip_l | text encoder | .safetensors | models/clip/ | CLIP Loader or text encoder nodes |
| T5XXL | large text encoder | .safetensors, GGUF quantised files | models/clip/ or GGUF-related path | T5 / dual text encoder nodes |
| vae | image encoder / decoder | .safetensors | models/vae/ | VAE Loader |
| lora | finetune / style weight | .safetensors | models/loras/ | Load LoRA |
| unet / diffusion model | core generative module | .safetensors, GGUF, etc. | workflow-dependent, often models/unet/ | dedicated loaders |
| custom node | feature extension, not a model | folder / Python package | custom_nodes/ | extra nodes rather than weights |
The basic installation SOP
Step 1: identify what you actually downloaded
Do not assume every .safetensors file belongs in checkpoints/.
The same file extension can represent:
- a checkpoint,
- a VAE,
- a LoRA,
- a CLIP encoder,
- a UNet,
- a T5XXL weight.
Step 2: read the model page or workflow notes
The best source is usually:
- the official model page,
- a ComfyUI template workflow,
- the workflow author’s README.
Step 3: place the file in the correct folder
The common pattern is:
- checkpoint →
models/checkpoints/ - VAE →
models/vae/ - LoRA →
models/loras/ - CLIP / T5 →
models/clip/
Step 4: restart or refresh ComfyUI
Sometimes ComfyUI picks up new assets automatically. Sometimes restarting is quicker.
Step 5: validate with a simple or official workflow first
ComfyUI’s troubleshooting docs explicitly recommend starting from template workflows or official examples when trying a new model.
Why do some models feel like one file, while others feel like assembling a machine?
SD 1.5 / SDXL-style workflows
These often look relatively straightforward:
- one checkpoint,
- maybe a VAE,
- maybe a LoRA.
Flux / HiDream-style workflows
These more modern and heavier workflows often involve:
- extra text encoders,
- T5XXL,
- CLIP,
- split UNet or diffusion model components,
- GGUF quantised variants,
- dedicated loader nodes,
- custom nodes.
What is LCM, and what is an LCM adapter?
LCM stands for Latent Consistency Model.
In practice, what many ComfyUI users install is an LCM LoRA or LCM adapter.
The basic idea is:
- you already have SDXL or SD 1.5,
- you add an LCM adapter,
- the workflow can then produce images in fewer steps.
Why do you need an LCM adapter?
Because it does not replace the base model. It gives the base model a lower-step acceleration path.
What is Hugging Face, and why does everything eventually lead there?
Because it is one of the closest things the ML world has to a serious public model warehouse.
People keep sending you to Hugging Face because model pages there usually contain:
- model cards,
- file listings,
- licensing,
- usage notes,
- discussions,
- gated-access notices where relevant,
- and often safetensors versions.
Why is Flux so often a source of pain?
Because the trouble is not only that the main model is large. The whole workflow stack is heavier.
The usual failure modes look like this:
- you are missing a loader node
- you are missing a custom node
- you downloaded GGUF files but the workflow expects a normal loader
- you downloaded safetensors, but the workflow is built around quantised assets
- you are missing T5XXL or CLIP
- the files are in the wrong folder
- the scheduler / sampler path does not match the workflow version
What is a loader node?
A loader node is the node responsible for reading a particular type of model file into ComfyUI.
What is a custom node?
A custom node is not a model. It is a ComfyUI extension package.
What is GGUF, and why does it keep appearing?
A lot of people first met GGUF in the local LLM world. It is a quantised format closely associated with llama.cpp-style tooling.
The appeal is straightforward:
- smaller or more manageable assets,
- less memory pressure,
- a fighting chance of running larger components on local machines.
The trade-offs are just as straightforward:
- you often need extra loader nodes,
- compatibility is more fiddly than with ordinary checkpoints,
- the workflow becomes more complicated.
Why do GGUF workflows need extra loaders or custom nodes?
Because ComfyUI’s built-in nodes do not necessarily know how to read these quantised formats.
That is why projects such as city96/ComfyUI-GGUF exist.
Why do some people convert GGUF back to safetensors?
Usually for:
- better compatibility,
- simpler workflow setup,
- fewer custom node dependencies.
What is safetensors, and why do people keep recommending it?
safetensors is a weight format popularised across the Hugging Face ecosystem.
Its biggest attraction is simple:
- it is widely supported,
- it avoids the arbitrary-code-execution risks associated with pickle-style model loading,
- and it usually makes sharing weights less nerve-wracking.
What is a UNet, and why do some workflows want it in models/unet/?
In many diffusion architectures, the UNet is one of the core denoising components.
Some ComfyUI workflows package everything as a single checkpoint. Others split pieces apart. When that happens, you may see:
- a UNet stored separately,
- CLIP stored separately,
- VAE stored separately.
What is T5XXL, and why does Flux want it?
T5XXL is a large text encoder.
In Flux-style workflows, it is there to process prompt text in a richer way before that information reaches the image model.
Where do you get it?
Common sources include Google’s t5-v1_1-xxl and ComfyUI-adapted versions or quantised packs shared by the community.
The original Hugging Face repo is enormous. The main model.safetensors is listed at 44.5GB.
Q4 vs Q5
If you are using a GGUF quantised version, you will often see names like:
- Q4
- Q5
The practical distinction is:
- Q4 usually means lower resource use,
- Q5 usually means a bit more fidelity but more pressure on your machine.
What is a CLIP text encoder, and why do you need it?
CLIP’s job is to place text and image concepts into a shared semantic space.
Is there an OOM risk?
A CLIP file on its own is not usually the main villain. The trouble appears when it works alongside a large T5XXL, a large main model, MPS, and the rest of the graph.
OOM means out of memory.
On Apple Silicon, it does not always present itself politely. Sometimes it means:
- the workflow becomes painfully slow,
- ComfyUI freezes,
- a generation step crashes,
- or the OS quietly cleans house.
What about Apple Silicon throttling?
Apple’s thermal and power management is generally quite good. Rather than exploding dramatically, the machine may simply become slower under sustained load.
What is a LoRA, and why would you use one?
A LoRA is a low-rank adaptation weight.
It lets you push a base model towards a specific style or use case without replacing the whole thing.
Why do people keep mentioning RealisticVision?
Because many users do not merely want “a face”. They want skin, lighting, lens feel, and texture that read more like photography.
Why does SDXL LCM sometimes look plasticky on its own?
Usually because:
- it prioritises speed,
- low-step generation tends to flatten texture,
- and a base workflow without realism-oriented finetuning can look overly smooth.
What is clip_l, and where do you get it?
clip_l usually refers to a CLIP-L / ViT-L style text encoder asset.
In some Flux workflows, it is explicitly required.
The common sources are:
- OpenAI’s CLIP ViT-L/14 Hugging Face repository,
- or ComfyUI-adapted packs provided alongside a workflow.
What is a VAE, and where do you get one?
The VAE handles image encoding and decoding.
In SDXL and some community workflows, you may see specific VAE files referenced by name, such as ae.safetensors or other released assets.
Common sources are:
- official model repositories,
- workflow-specific releases,
- ComfyUI-oriented community packs.
What is KSampler? What is Flux2Scheduler? And why do sigma bugs happen?
What is KSampler?
KSampler is one of ComfyUI’s core sampling nodes.
It is effectively the stage that controls how noise turns into an image over a sequence of denoising steps.
What is Flux2Scheduler?
This is the sort of scheduler node used in Flux-oriented workflows to match the expectations of that model family.
What is a sigma mismatch, and why does it cause bugs?
Think of sigma as part of the denoising rhythm.
If the sampler, scheduler, model implementation, and loader path are not aligned, you can end up with:
- strange images,
- hard failures,
- or workflows that technically run but produce nonsense.
Why is GGUF + MPS more bug-prone?
Usually because several tricky pieces overlap:
- GGUF introduces a quantised path,
- MPS is Apple’s GPU backend,
- custom nodes may handle tensors in their own way,
- Flux-family schedulers can be more sensitive.
Why do some people switch to FP8 checkpoints? And why can CheckpointLoaderSimple replace a GGUF workflow?
What is an FP8 Flux checkpoint?
It is a checkpoint whose weights have been quantised to FP8 precision.
But do not assume “FP8” means “tiny”. Community FP8 Flux files are still often in the 10GB to 17GB range.
What is CheckpointLoaderSimple?
It is ComfyUI’s built-in loader for checkpoint files.
If you have a checkpoint format ComfyUI can read directly, and the matching text encoders / VAE are in place, then the workflow can become a lot simpler.
Why can that replace a GGUF workflow?
Because GGUF workflows often become complicated due to:
- extra quantised formats,
- extra loaders,
- extra custom node dependencies.
A practical installation order for beginners
Path A: SDXL / SD 1.5 beginner route
- install the checkpoint
- add the VAE if required
- add LoRAs afterwards
- test with a simple workflow
- only then try an LCM adapter
Path B: Flux-style workflow
- confirm the exact main model version
- confirm whether it needs
clip_l,t5xxl, and a VAE - confirm whether the graph expects safetensors or GGUF
- if GGUF is involved, install the custom node / loader first
- place every file in the correct folder
- if the graph goes red, check whether you are missing nodes or weights before blaming fate itself
One-sentence takeaway
The hardest part of installing models in ComfyUI is not downloading them. It is understanding the role each file plays inside the workflow.
Image Asset Plan
-
filename: comfyui-model-file-structure.svg purpose: show the relationship between checkpoints, clip, loras, vae, unet, and custom_nodes in a single easy diagram placement: after the file-type table alt: Diagram of common ComfyUI model folders and file types prompt: Create a clean blog-friendly SVG diagram showing the ComfyUI model folder structure. Include models/checkpoints, models/clip, models/loras, models/vae, models/unet, and custom_nodes. Show what each folder is for with concise English labels, rounded rectangles, soft colours, and clear arrows from file type to loader node.
-
filename: flux-workflow-assets-map.svg purpose: explain why a Flux workflow in ComfyUI may need a main model, CLIP, T5XXL, VAE, GGUF loader, and custom nodes placement: after the section on Flux pain points alt: Dependency map for a Flux workflow in ComfyUI prompt: Create a clean blog-friendly SVG architecture diagram for a Flux workflow in ComfyUI. Show the main model, CLIP, T5XXL, VAE, optional GGUF file, custom node / loader dependency, and KSampler / scheduler path. Keep it clean, minimal, and suitable for a technical blog.