MCP Engineering Deep Dive 04: Skills Are Not the Same Thing as Runtime Tool Selection

Subtitle: Having skills in your repo does not mean the runtime will actually route by them. In practice, the model is far more influenced by the tool surface you expose, the live inventory a session can see, and whether your descriptions, schemas, and instructions line up.

This is not another “skills are important” article.

What I want to write down is something much more operational:

Writing a skill package does not guarantee that the runtime will behave like a skill-first system.

I learned that the hard way.

The request that exposed the gap was almost embarrassingly simple:

Score all the jobs that still do not have a score.

What I expected was:

route to job_scoring
stay on a score-only path
do not fetch fresh jobs again

What actually happened looked much closer to an ingestion path. That was the moment I stopped treating skills, tool metadata, live inventory, and runtime tool selection as a single thing.

Skill assets are not the same thing as runtime selection

The short version: a skill is an asset layer, not an automatic routing gate

FastMCP does provide skills utilities. You can discover, download, and sync skills, and it can identify skill resources using URIs such as skill://{name}/SKILL.md. That matters. But what it describes is how skill assets are discovered and retrieved, not “the model will definitely read those skills first and obey them”.

At the protocol level, the first-class object the model actually gets to invoke is still the tool. MCP’s tools spec is very plain about this: tools are exposed with a unique name and schema metadata. OpenAI’s tool design guidance is equally plain: write concise, explicit descriptions, because the model chooses what to send based on your description.

So the mental model I trust now looks like this:

skill: strategy asset, boundary asset, operating documentation
tool: the runtime capability surface the model actually sees
server instructions: the server-level operating manual
provider / inventory: what this specific session can list and call right now

If those four layers are not aligned, your system may look skill-first on disk while behaving quite differently at runtime.

The problem only became clear after I split it into four layers

This is the debug map I use now.

Layer 1: skill assets in the repo

This is the GitHub layer:

manifest.yaml
instructions.md
examples.md
eval-notes.md
router policy / fallback policy

This layer matters because it captures your design intent.

Layer 2: the tool surface actually exposed by the server

This is what the model really sees:

tool name
description / docstring
input schema
output shape
which tools are listed at all

This is the interface the runtime touches every single turn.

Layer 3: live deployment and session inventory

This is the layer many teams forget:

did you actually deploy the new code
did the inventory refresh
is the client session seeing the latest version
are you still effectively serving old metadata

Layer 4: runtime tool selection

This is the final behaviour:

which tools were listed
which one looked most plausible for the request
which schema seemed to fit the intent best
whether server instructions actually helped

The four layers that shape runtime selection

Once you separate those layers, a lot of “why is the model being stupid?” complaints turn into more useful engineering questions:

was the skill boundary poorly written?
or was the tool description sending the opposite signal?
or was the session still stale?
or were two tools simply too similar from the model’s point of view?

In practice, the decisive signal is often not the document you wrote for humans

OpenAI’s guidance for tools contains one sentence I would happily pin to a wall:

Write concise, explicit tool descriptions. The model chooses what to send based on your description.

That sounds like a writing tip. It is really a runtime systems tip.

If the model sees:

a tool called job_ingestion
a parameter called force_rescore
and no clear description saying “this is not the score-only path”

then routing a backfill-scoring request into ingestion is not bizarre at all.

That is exactly why a skill file can be correct while the runtime still drifts.

A surface that tends to mislead

name: job_ingestion
input_schema:
  type: object
  properties:
    source_site:
      type: string
    days:
      type: integer
    page_from:
      type: integer
    page_to:
      type: integer
    force_rescore:
      type: boolean

From a model’s point of view, that looks like a very capable Swiss Army knife.

A healthier separation

name: job_scoring
public: true
allowed_backend_tools:
  - v2_tool_bulk_score_new_jobs
guardrails:
  - Do not fetch recent jobs in this skill.
  - Do not claim a fetch or refresh happened in this path.

That is essentially the spirit of the job_scoring manifest in your repo: score-only really means score-only. Once the manifest, server docstring, schema, and examples all say the same thing, runtime drift becomes much less likely.

A deterministic policy on disk is still not the same thing as a live routing gate

Your router/skill-selection-policy.md is already doing an important piece of architectural work:

choose exactly one public skill first
prefer job_decision_support over job_scoring
prefer job_scoring over job_ingestion
keep job_querying as the broad read-oriented path
avoid automatic multi-skill chaining in v1

That is valuable because it defines how the system should think.

But I want to be very direct here:

Having that document does not mean the live runtime is enforcing it.

Why not? Because:

the policy file is design intent
runtime selection is the model reacting to the current visible tool surface

Unless you actually turn that policy into one or more of the following:

visibility rules
deterministic gating before exposure
tool subsets through allowed_tools
server-side filtering of which backend tools are reachable

then the policy remains closer to architecture guidance than an active traffic light.

The real fix is usually not more prompt text. It is a cleaner capability boundary

This is the lesson I trust most now.

If a request keeps drifting into the wrong tool, do not start by blaming the model. Start by asking:

Are two tools too similar from the model’s perspective?
Did we leak low-level capability into the public surface?
Are the manifest, description, schema, and examples out of sync?

The design moves that helped most

Move 1: one public tool, one primary intent

job_ingestion handles refresh / fetch
job_scoring handles score-only / backfill
job_querying handles shortlist / filter / rank
job_decision_support handles deep output for one job

Move 2: internal helpers should stay internal

A helper like resolve_job_reference can be critical without belonging on the public tool surface.

Move 3: change the manifest, docstring, schema, and examples together

If you only update one layer, the system still behaves like an instrument with one string out of tune.

The fix is boundary redesign, not just prompt tweaking

A more practical development loop

If you want a skill-based MCP server to behave like one in production, this is the workflow I now recommend.

Step 1: write the skill manifest before the server docstring

Be explicit about:

the public skill name
the user intent it owns
the backend tools it may call
what it must not do
which missing context should fail hard

Step 2: align the server tool surface with that manifest

the tool name should match the boundary
the description should say what it does and does not do
the schema should not keep ambiguous “maybe useful” parameters
hidden helpers should remain hidden

Step 3: restart the server and reconnect the client

Do not assume the session is automatically seeing your new inventory.

Step 4: verify the inventory with a minimal client

I strongly recommend checking list_tools() and list_skills() before asking ChatGPT to act as an oracle.

from fastmcp import Client
from fastmcp.utilities.skills import list_skills

async with Client("https://your-server.example.com/mcp") as client:
    tools = await client.list_tools()
    skills = await list_skills(client)
    print([t.name for t in tools])
    print([s.name for s in skills])

Step 5: only then test natural language requests

And do not only test the happy path.

Test at least:

score backfill
refresh + score
shortlist query
single-job deep analysis
mixed-intent requests

Server instructions are worth using, but not worth worshipping

The MCP maintainers have published a very good note on server instructions. I agree with the premise: server instructions are a kind of user manual for the model, and they can materially improve tool use.

But I would add one sentence of caution:

Server instructions help, but they do not replace a clean tool surface.

I treat them as:

a navigation layer above skills
a place to capture selection heuristics and prohibitions
a way to explain the operating style of the server

not as a magic patch for messy public tools.

The checklist I keep now

Before I re-test routing after changing skills or server metadata, I run this list:

does this request map to exactly one public tool?
is the repo manifest aligned with the docstring?
does the schema leak any misleading parameter?
are internal helpers actually hidden?
do the server instructions include the most important operating hints?
was the live deployment restarted?
was the client session refreshed?
do list_tools() and list_skills() show the inventory I think I shipped?

A practical testing loop for skills and runtime selection

The final sentence

The sentence I trust least now is this one:

The skill is written, so the model should know what to do.

No. At best, that means your strategy assets are in better shape.

Actual runtime behaviour still depends on:

what the public tool surface looks like
what inventory the current session can see
what server instructions say
which version the host has really loaded

If you want a skill-first MCP server to behave like one, do not stop at writing skills. Treat manifest design, docstrings, schemas, visibility rules, session refresh, and runtime tests as one engineering job.