Alex Ewerlöf Notes

Alex Ewerlöf Notes

Code

RAG vs SKILL vs MCP vs RLM

Comparing various techniques to make the models more reliable while working around context window limitation

Alex Ewerlöf's avatar
Alex Ewerlöf
Feb 25, 2026
∙ Paid

LLMs are generalists. Regardless if they’re foundation models, instruct models or thinking models, there’s a limit to what they can do in terms of specialized work.

Foundation vs. Instruct vs. Thinking Models

Foundation vs. Instruct vs. Thinking Models

Alex Ewerlöf
·
December 24, 2025
Read full story

That’s where RAG, SKILL, MCP and RLM come in. These are methods that give generalist LLMs knowledge, tools and interfaces to do specialized tasks with more reliability.

This post briefly describes each technique, implementation/usage mechanics, cons/pros, and tips on when to use or avoid it.

Disclosure: some AI is used in the early research and draft stage of this this page, but I’ve gone through everything multiple times and edited heavily to ensure that it represents my own thoughts and experience.

1. RAG

Retrieval-Augmented Generation

At its core, RAG is the AI equivalent of Just-In-Time (JIT) dependency injection. LLM’s weights are static after training. What if we want to add proprietary or up to date information to it?

RAG introduces an external lookup mechanism that executes before the user prompt is submitted to the model. The goal is to dynamically append highly relevant, specialized knowledge directly into the execution context.

Implementing RAG

Before RAG can be used, the knowledge base must be prepared and indexed into a searchable format.

  1. Ingestion: Raw domain data (documents, wikis, logs) is collected and parsed.

  2. Chunking: Text is split into smaller, semantically meaningful segments to fit within embedding and context window limits.

  3. Embedding: An embedding model converts the text chunks into high-dimensional vector representations (basically a numerical array).

  4. Storage: The vectors and their corresponding text chunks are saved in a vector database for rapid similarity search like SQLite-Vector, Postgress pgvector, Pinecone or just a simple array that’s loaded from JSON.

Using RAG

When a user interacts with the system, the pre-built vector database is queried to inject context dynamically.

  1. Retrieval: The user’s query is intercepted and converted into a vector using the same embedding model.

  2. Search: The system performs a similarity search (e.g., cosine similarity) in the vector database to find the most relevant chunks.

  3. Injection: The retrieved text is prepended or appended to the user’s prompt as context.

  4. Generation: The LLM processes the augmented prompt to generate an informed response.

Pros and Cons of RAG

  • Pros: Conceptually simple. Decoupled from model implementation details. Strictly bounds the LLM to provided facts (reducing hallucinations). Heavily adopted with mature tooling. Requires zero model fine-tuning. RAG data and vector DB can be updated without touching LLM.

  • Cons: Highly dependent on the quality of the embedding model and chunking strategy. Lexical or semantic mismatch can cause silent retrieval failures. vector DB introduces additional infrastructure overhead and state management complexities.

RAG Use case

  • Use RAG for when you need to query static or slowly changing knowledge bases (like corporate wikis, documentation, or historical logs) where the volume of data exceeds the LLM context window but fits well within a search paradigm.

  • Don’t use RAG for real-time transactional data, tasks requiring complex multi-step reasoning over the entire dataset, or when the specialized logic is behavioral rather than informational.

Note: if the size of the dataset is small, you can skip the embedding and vector DB by directly including it to the system prompt. This method is called CAG (Cache-Augmented Retrieval) but due to simplicity and limited application I didn’t include it in the list.

See also

  • PageIndex: a different approach to RAG which skips the embedding mechanism altogether and instead provides a table of context for the agent to navigate

  • GraphRAG: from Microsoft research, another approach where a knowledge graph maps the relation between different chunks of information (as opposed to the simplest form of RAG which we discussed).

2. SKILL

Dynamic Capability Loading

If RAG is like Just-In-Time dependency injection, SKILL operates like Dynamic Link Libraries (DLLs).

SKILL reverses the RAG flow: instead of a rigid vector search blindly injecting data, the LLM itself decides what capabilities it needs to acquire based on the context of the conversation.

This also eliminates the need for the embedding model and vector DB, making it much easier to use.

Implementing SKILL

SME (subject matter experts) must first define the skills, write the deterministic code, and iteratively evaluate the LLM’s routing behavior before deployment.

  1. Definition: Engineers write clear, concise descriptions of specific capabilities (e.g., “Financial Calculator”, “User Authentication Manager”).

  2. Scripting: Deterministic code scripts (e.g., Node.js or Python functions) are written to handle tasks that LLMs are bad at, like math or precise string formatting.

  3. Evaluation (Eval Loop): The skill is tested against a “golden dataset” of test queries. The evaluation framework checks if the LLM correctly routes to the new skill and if the tool returns the expected output. Failures trigger refinements to the skill description or the underlying scripts.

  4. Registration: Once the skill passes the evaluation threshold, the skill descriptions, manuals, and executable scripts are registered in a central Skill Registry (e.g. Anthropic’s) accessible to the AI applications.

Using SKILL

Each SKILL has has a name and description field.

  1. Capability Broadcasting: The system prompt is injected with a lightweight list of available skill summaries (their name and description).

  2. Evaluation: The LLM evaluates the user prompt against its known capabilities.

  3. Retrieval: If needed, the LLM requests to load the specific full skill manual or script it requires to solve the problem.

  4. Augmentation & Tooling: The system loads the requested skill. Optionally, the LLM makes tool calls based on the deterministic scripts offered by the skill (the orchestrator executes those calls).

  5. Execution: The LLM uses the results of the tool execution and the newly loaded context to formulate a highly specialized answer.

Pros and Cons of SKILL

  • Pros: Drastically reduces token usage by only loading what is needed. Utilizes the LLM’s superior reasoning for routing rather than relying on dumb embedding model and vector similarity. Allows mixing stochastic reasoning with deterministic script execution (excellent for math or rigid logic).

  • Cons: Introduces multi-turn latency before the user gets an answer. Requires a highly capable reasoning model to correctly identify which skill to load.

SKILL Use case

  • Use SKILL for agentic workflows where the LLM has access to hundreds of potential tools, but loading all tool definitions would bloat the context window or confuse the model (the limit is around 50 tools after which LLM has difficulty loading relevant skills). It is particularly effective for offloading math or deterministic routing to simple scripts.

  • Don’t use SKILL for simple Q&A bots, low-latency synchronous APIs, or when using smaller, less capable models that struggle with multi-step gradual capability enhancement.

See also

  • npx skills: a CLI that pairs with skills.sh to find and download skills off the internet directly to your machine.

  • Skill specification: from Anthropic. They also have a “marketplace” for skills called Awesome Skills. Here’s another listing. And yet another one. Or this one or that one. As you can see there’s no shortage of these skill directories. 😄

3. MCP

Model Context Protocol

MCP is the POSIX standard or API Gateway for AI. Originally created to standardize how LLMs interact with external software (browsers, IDEs, databases, SaaS), MCP defines a strict client-server architecture. It exposes three core primitives:

  • Prompts: reusable prompt templates✳️

  • Tools: executable functions✳️

  • Resources: contextual data and files

✳️ You may notice that MCP has prompt and tools in common with SKILLS. Although MCP was originally introduced as a translation layer, many MCPs are self-contained, composed of a prompt and tool. For those cases it’s better to use SKILLs.

Implementing MCP

The MCP server acts as an integration layer that must be configured to talk to external systems and rigorously tested for translation accuracy.

  1. Server Setup: An MCP server instance is provisioned on the network or locally. This can be as simple as a docker container or even an npx command.

  2. Configuration: Engineers define the Resources (e.g., file paths, database schemas) and Tools (e.g., API POST requests) the server will expose.

  3. Authentication & Routing: The server is configured with the necessary credentials and network routes to securely communicate with the target external systems. MCP server acts as an OAuth 2.1 resource server and MCP client acts as an OAuth 2.1 client.

  4. Integration Evaluation (Eval Loop): Automated test suites prompt an LLM to interact with the newly configured MCP server. The Eval framework validates that the LLM correctly discovers tools, forms valid JSON-RPC requests, and that the target API responds accurately without unintended state changes.

Using MCP

The LLM client establishes a standardized connection to interact with the environment.

  1. Discovery: The LLM client (not the raw LLM) connects to the MCP server and queries its available Prompts, Resources, and Tools via standardized JSON-RPC.

  2. Integration: The LLM reads a Resource (e.g., pulling a GitHub issue) or invokes a Tool (e.g., triggering a build pipeline). The LLM client makes the call.

  3. Translation: The MCP server translates the standardized LLM JSON-RPC request into the proprietary API calls of the target software.

  4. Callback: The external software executes the action and returns the result through the MCP server back to the LLM Client which will be handed to the LLM.

Pros and Cons of MCP

  • Pros: Decouples the LLM from the target API. Allows write-once, use-anywhere tool creation (an MCP server works with Claude, local tools, or custom interfaces alike).

  • Cons: Architecture can be heavy and rigid. As noted, self-contained MCPs often bundle too much context upfront, making them less efficient than the dynamic loading of SKILLs. LLM Client and Server (which sit between the LLM and an API) add to the system complexity. More complexity generally means more risk for things going wrong and less reliability.

MCP Use case

  • Use MCP for connecting LLMs to complex, stateful external systems (databases, SaaS platforms, local filesystems) where standardization, security, and reusability across different AI clients are a hard requirement.

  • Don’t use MCP for internal, tightly-coupled micro-agent interactions or self-contained tasks where the dynamic, lightweight nature of SKILL is more performant and cost-effective.

See also

  • WebMCP proposes two new APIs that allow browser agents to take action on behalf of the user:

    • Declarative API: Perform standard actions that can be defined directly in HTML forms.

    • Imperative API: Perform complex, more dynamic interactions that require JavaScript execution.

  • Chrome DevTools MCP) is one of my favorites because it gives the agent “eyes” to see the result of its code and debug front-end code.

  • Just like skills, there’s no shortage of MCP marketplaces and directories like this one, that one or this other one.

4. RLM

Recursive Language Models

If SKILL operates like DLLs, RLM is like FFI (foreign function interface).

RLM is the newest architectural evolution, functioning similarly to MapReduce combined with a recursive REPL (Read-Eval-Print Loop) like the one you find in Python or Node.js.

Its primary goal is to entirely bypass the physical constraints of the LLM context window. Instead of trying to stuff a massive prompt into the model, RLM treats the long prompt as an external environment variable.

Implementing RLM

User's avatar

Continue reading this post for free, courtesy of Alex Ewerlöf.

Or purchase a paid subscription.
© 2026 Alex Ewerlöf · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture