If the first six articles in this series were mostly about sharpening concepts, setting boundaries, and clearing technical debris off the road, this one is about putting those ideas back into a real workflow and seeing how they actually hold together.
This is not meant to be a full tutorial on how to clone my job agent from scratch.
It is better thought of as a case study that answers questions like:
- why did I end up treating the JD, the CV, and the rubric as three different kinds of data?
- why should the CV be chunked finely while the rubric should not be chopped too aggressively?
- why can’t fields like
source_typeandmemory_setjust sit in payload as decorative metadata? - why does the evidence pack need to be kept deliberately lean?
- why do some issues that look like model failures turn out to be retrieval-boundary problems instead?
If you have read the earlier pieces, this is the point where those rules stop sounding theoretical and start looking like practical ways of preventing very ordinary system failures. fileciteturn15file0turn15file3turn15file4
The core claim
If I had to start with one line, it would be this:
What made this job agent more stable was never a magical model or a single clever tool. It was turning the JD, the CV, and the rubric into a retrieval system with a usable evidence pipeline.
The most important phrase there is not “agent”. It is evidence pipeline.
Because once the system actually runs for a while, many of the problems that appear to be model problems turn out to be something else entirely: the evidence is shaped wrongly, assembled in the wrong order, mixed across roles, or simply too bloated for the model to use cleanly. fileciteturn15file0turn15file4
If you treat it as “just give the JD and CV to the model”, you hit a wall quite quickly
The first naive version of this workflow is easy to imagine:
- pass the full JD
- pass the full CV
- add some rubric or scoring instructions
- ask the model to score, justify, and perhaps draft a cover letter
That is not completely useless.
For small documents, simple jobs, and one-off tasks, it may even look acceptable at first.
But once you try to turn it into a repeatable workflow, the weaknesses start surfacing:
- the CV is too long, so the model grabs the wrong details
- a hard gate in the rubric gets overlooked
- the same JD gets a slightly different score on different days
- the relevance rationale begins smuggling in narrative that was not really grounded in the JD
- the evidence pack gets fatter and fatter until you hit token ceilings or
MAX_TOKENStrouble fileciteturn15file0turn15file4
At that point, this no longer feels like a prompt-tuning issue. It starts to look like what it really is:
the evidence units, data roles, and retrieval boundaries were never properly designed in the first place.
In this system, the JD, the CV, and the rubric are not the same kind of data
At first it is easy to think, “they are all just text, so surely I can embed them, index them, and retrieve them in roughly the same way.”
In practice, that turns out to be a poor mental model. These three sources do different jobs inside the system.
The JD: task context and factual requirements
The JD acts as the task frame. It tells the system:
- what the job is looking for
- what kinds of deliverables matter
- which requirements are explicit
- which constraints are non-negotiable
In many cases, the JD does not even need to be chunked immediately. Passing the full text first is often the steadier move, because the model needs the overall shape of the role rather than one isolated bullet point. fileciteturn15file4
The CV: an evidence pool
The CV is not the rulebook, and it is not the task definition.
It is an evidence pool from which the system needs to pull the most relevant experience for the JD at hand.
That is why fat CV chunks are so dangerous.
If one chunk contains too many unrelated signals, then a single match on AI or growth language may bring back an oversized block, and the model may start dragging unrelated material into the reasoning. fileciteturn15file0
The rubric: a rule system
The rubric is closer to a rulebook.
Its job is not to inspire the model. Its job is to constrain the scoring logic, preserve output shape, and keep important gates from drifting out of view.
That is why it should not be sliced like a CV.
You want rule groups that are self-contained. Otherwise top-1 or top-2 retrieval may bring back half a rule and leave the rest behind. fileciteturn15file0
Chunking did not just change retrieval. It changed downstream output stability.
Why the CV needs finer cuts
Because the task is not “summarise this person”.
It is more like:
- which parts of this CV support this JD?
- which bullets are the best evidence for a given requirement?
- which angle is strongest for the final writing step?
That is an evidence-mapping problem, not a document-summary problem. So the CV benefits from being split by evidence-bearing themes or retrievable signals. fileciteturn15file0turn15file4
Why the rubric should not be too fragmented
If the rubric becomes too fragmented, retrieval may pull criteria without gates, format without constraints, or only half the scoring frame.
The resulting instability is especially annoying because it often does not look like a dramatic failure. The system simply becomes a little more inconsistent over time, which is far harder to spot and far more corrosive. fileciteturn15file0
The evidence pack has to be kept deliberately lean
Many people react to unstable outputs by adding more evidence:
- more chunks
- higher top-k
- extra profile context
- more rules
But the evidence pack is not better simply because it is bigger.
A bloated pack does not necessarily make the answer more grounded. Often it just makes the model noisier and more expensive. fileciteturn15file0turn15file4
In this workflow, a more practical recipe looked roughly like:
- rubric: topK = 2
- profile: topK = 1
- CV: topK = 2 to 3
- JD: pass full text first, revisit JD chunk retrieval later if needed
That is useful not because those numbers are sacred, but because they force one important discipline:
evidence-pack design is part of retrieval design, not an afterthought. fileciteturn15file0turn15file4
Why source_type and memory_set cannot remain decorative payload fields
As soon as the system starts storing several kinds of data in the same collection, metadata stops being decorative.
Fields such as:
source_typememory_setrubric_idprofile_idjob_id
may look boring, but they define the retrieval boundaries.
If those fields are not treated as proper query dimensions, retrieval starts to become semantically plausible but structurally muddy. fileciteturn15file1turn15file2
This is why I ended up thinking about payload indexes as schema migration rather than as some annoying Qdrant requirement. The system is not just doing vector similarity. It is doing vector similarity plus structured boundaries plus role-aware evidence lanes. That means schema has entered the picture whether you like it or not. fileciteturn15file1
Many things that look like model hallucination are really retrieval-boundary failures
This case also forces a useful habit: learning to ask whether the real failure happened in R or in G.
A weak relevance rationale may look like a model problem.
But once you inspect the evidence pack, you often find something else:
- the right CV chunk never made it into retrieval
- the chunk that came back was too fat
- the rubric came back only partially
source_typeboundaries leaked- top-k was high enough for noise to drown the evidence that actually mattered fileciteturn15file0turn15file4
At that point, the model is not really improvising from nowhere. It is often just explaining a messy context with alarming fluency.
The system became more stable not because it became more complex, but because the boundaries became clearer
The things that helped most were roughly these:
- separating the roles of JD, CV, and rubric
- using different chunking logic for CV and rubric
- keeping the evidence pack deliberately lean
- using fields like
source_typeandmemory_setas real boundaries - adopting the habit of inspecting evidence before blaming the model
- treating output stability as a retrieval-design concern rather than a lucky prompt side effect
When the lessons from this case do not generalise cleanly
Not every RAG system resembles job matching.
These rules work here because the task is fundamentally about:
- requirement matching
- evidence-backed scoring
- constrained writing
If the task were:
- FAQ assistance
- long-form summarisation
- contract Q&A
- code retrieval
- multimodal document reasoning
then the ideal chunking, retrieval, and schema patterns might look quite different.
So this article should be read as a case study with reusable lessons, not as a universal template.
The rules I actually trust after building this workflow
- separate data roles before you optimise retrieval
- the CV behaves like an evidence pool; the rubric behaves like a rulebook
- the evidence pack needs to stay intentionally lean
- metadata fields are not decoration; they define retrieval boundaries
- many apparent model problems are really evidence-boundary problems
- production RAG stability usually depends less on model power than on evidence-pipeline design
What I wanted this whole series to leave behind
The thing I most want to leave behind is probably not a single tooling conclusion. It is this feeling:
the hard part of RAG is often not the embeddings, not the model, and not even the vector database by itself. It is whether you have actually thought through the role, boundary, granularity, and flow of evidence.
Once those things become clear, the system starts to feel much less like magic and much more like engineering.