17. AI Inference Course: How Modern AI Systems Think, Retrieve, Call Tools, and Serve Users

Course Positioning

This is a technical course on what happens after a model exists: prompting, tokenization, decoding, context management, retrieval, tool use, structured outputs, evaluation, serving, latency, cost, safety, and inference-time optimization. It is ideal for people building AI products without training frontier models.

Learning outcomes

Explain the inference-time behavior of transformer language models: tokens, logits, sampling, context windows, and decoding.
Build reliable LLM applications using prompt engineering, RAG, function calling, structured outputs, and guardrails.
Evaluate model outputs with task-specific metrics, golden datasets, human review, and automated checks.
Optimize for latency, cost, reliability, safety, and user experience.
Prototype an inference-time system that improves model usefulness without retraining.

Expanded Topic-by-Topic Coverage

Module 1. The inference stack

Module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer. Primary live activity or lab: Trace a user request through an LLM app architecture. Expected take-home output: Inference system diagram.

Topics and coverage

Model

What it means: place Model inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.

tokenizer

What it means: define tokenizer clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

prompt

What it means: explain how prompt changes the interaction between human intent, model behavior, external information, and final output.
What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.

context

What it means: define context clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

decoding

What it means: define decoding clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

output parser

What it means: define output parser clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

tools

What it means: define tools clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

retrieval

What it means: explain how retrieval changes the interaction between human intent, model behavior, external information, and final output.
What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.

application layer

What it means: define application layer clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Trace a user request through an LLM app architecture.
Learners produce: Inference system diagram.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Module 2. Tokenization and context

Module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes. Primary live activity or lab: Inspect tokenization and design context budget. Expected take-home output: Context budget plan.

Topics and coverage

Tokens

What it means: define Tokens clearly and connect it to the module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

context windows

What it means: define context windows clearly and connect it to the module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

truncation

What it means: define truncation clearly and connect it to the module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

prompt packing

What it means: explain how prompt packing changes the interaction between human intent, model behavior, external information, and final output.
What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.

long-context failure modes

What it means: define long-context failure modes clearly and connect it to the module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Inspect tokenization and design context budget.
Learners produce: Context budget plan.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Module 3. Decoding and sampling

Module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity. Primary live activity or lab: Compare outputs across decoding settings. Expected take-home output: Decoding experiment log.

Topics and coverage

Greedy decoding

What it means: define Greedy decoding clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

temperature

What it means: define temperature clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

top-p

What it means: define top-p clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

top-k

What it means: define top-k clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

beams

What it means: define beams clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

repetition

What it means: define repetition clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

determinism

What it means: define determinism clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

diversity

What it means: define diversity clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Compare outputs across decoding settings.
Learners produce: Decoding experiment log.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Module 4. Prompt programs and structured outputs

Module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation. Primary live activity or lab: Build a structured extraction prompt with validation. Expected take-home output: Schema-based extractor.

Topics and coverage

System/developer/user prompts

What it means: explain how System/developer/user prompts changes the interaction between human intent, model behavior, external information, and final output.
What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.

schemas

What it means: define schemas clearly and connect it to the module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

JSON

What it means: define JSON clearly and connect it to the module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

constrained decoding

What it means: define constrained decoding clearly and connect it to the module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

output validation

What it means: define output validation clearly and connect it to the module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Build a structured extraction prompt with validation.
Learners produce: Schema-based extractor.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Module 5. Retrieval augmented generation

Module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes. Primary live activity or lab: Build a mini RAG pipeline over sample documents. Expected take-home output: RAG prototype.

Topics and coverage

Embeddings

What it means: explain how Embeddings changes the interaction between human intent, model behavior, external information, and final output.
What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.

chunking

What it means: define chunking clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

vector search

What it means: define vector search clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

hybrid search

What it means: define hybrid search clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

reranking

What it means: define reranking clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

citations

What it means: define citations clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

grounding

What it means: define grounding clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

failure modes

What it means: define failure modes clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Build a mini RAG pipeline over sample documents.
Learners produce: RAG prototype.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Module 6. Tool use and function calling

Module focus: APIs, calculators, search, databases, tool selection, error handling, permissions. Primary live activity or lab: Create a function-calling workflow for a simple task. Expected take-home output: Tool-using assistant.

Topics and coverage

APIs

What it means: define APIs clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

calculators

What it means: define calculators clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

search

What it means: define search clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

databases

What it means: connect databases to the data lifecycle from source and structure through analysis, interpretation, and decision-making.
What to cover: source reliability, missing or biased data, leakage, assumptions, calculations, and the difference between correlation and decision-ready evidence.
Demonstration: walk through a small dataset or example table and mark the checks required before trusting the result.
Evidence of learning: learners produce a short analysis note that includes assumptions, limitations, and verification steps.

tool selection

What it means: define tool selection clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

error handling

What it means: define error handling clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

permissions

What it means: define permissions clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Create a function-calling workflow for a simple task.
Learners produce: Tool-using assistant.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Module 7. Evaluation and observability

Module focus: Gold sets, unit tests for prompts, LLM-as-judge caveats, regression testing, logging, traces. Primary live activity or lab: Design an evaluation set and scoring rubric. Expected take-home output: Eval harness.

Topics and coverage

Gold sets

What it means: define Gold sets clearly and connect it to the module focus: Gold sets, unit tests for prompts, LLM-as-judge caveats, regression testing, logging, traces.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

unit tests for prompts

What it means: explain how unit tests for prompts changes the interaction between human intent, model behavior, external information, and final output.
What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.

LLM-as-judge caveats

What it means: explain how LLM-as-judge caveats changes the interaction between human intent, model behavior, external information, and final output.
What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.

regression testing

What it means: place regression testing inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.

logging

What it means: define logging clearly and connect it to the module focus: Gold sets, unit tests for prompts, LLM-as-judge caveats, regression testing, logging, traces.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

traces

What it means: define traces clearly and connect it to the module focus: Gold sets, unit tests for prompts, LLM-as-judge caveats, regression testing, logging, traces.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Design an evaluation set and scoring rubric.
Learners produce: Eval harness.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Module 8. Latency, cost, and reliability

Module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits. Primary live activity or lab: Estimate cost/latency for three architectures. Expected take-home output: Serving plan.

Topics and coverage

Caching

What it means: define Caching clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

batching

What it means: define batching clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

streaming

What it means: define streaming clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

model routing

What it means: place model routing inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.

smaller models

What it means: place smaller models inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.

fallbacks

What it means: define fallbacks clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

rate limits

What it means: define rate limits clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Estimate cost/latency for three architectures.
Learners produce: Serving plan.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Module 9. Safety and guardrails

Module focus: Input filtering, output moderation, prompt injection, data leakage, privacy, abuse cases. Primary live activity or lab: Red-team a RAG/tool app. Expected take-home output: Safety test report.

Topics and coverage

Input filtering

What it means: define Input filtering clearly and connect it to the module focus: Input filtering, output moderation, prompt injection, data leakage, privacy, abuse cases.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

output moderation

What it means: define output moderation clearly and connect it to the module focus: Input filtering, output moderation, prompt injection, data leakage, privacy, abuse cases.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

prompt injection

What it means: explain how prompt injection changes the interaction between human intent, model behavior, external information, and final output.
What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.

data leakage

What it means: connect data leakage to the data lifecycle from source and structure through analysis, interpretation, and decision-making.
What to cover: source reliability, missing or biased data, leakage, assumptions, calculations, and the difference between correlation and decision-ready evidence.
Demonstration: walk through a small dataset or example table and mark the checks required before trusting the result.
Evidence of learning: learners produce a short analysis note that includes assumptions, limitations, and verification steps.

privacy

What it means in this course: define privacy in operational terms, not as an abstract principle.
What to cover: sensitive data boundaries, affected stakeholders, approval paths, documentation, and what engineers, technical PMs, researchers, advanced students, founders must never delegate blindly to AI.
Use case: present one acceptable use, one borderline use, and one prohibited use, then ask learners to justify the classification.
Evidence of learning: learners add a risk control, review step, or escalation rule to their course project.

abuse cases

What it means: define abuse cases clearly and connect it to the module focus: Input filtering, output moderation, prompt injection, data leakage, privacy, abuse cases.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Red-team a RAG/tool app.
Learners produce: Safety test report.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Module 10. Inference-time optimization project

Module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning. Primary live activity or lab: Improve a baseline assistant on a task. Expected take-home output: Final technical prototype.

Topics and coverage

Self-consistency

What it means: define Self-consistency clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

critique loops

What it means: define critique loops clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

verifier models

What it means: place verifier models inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.

search

What it means: define search clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

reranking

What it means: define reranking clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

memory

What it means: define memory clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

planning

What it means: define planning clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
Demonstration: give one simple example, one realistic example, and one failure or limitation example.
Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.

Practice and evidence of learning

Learners complete or discuss: Improve a baseline assistant on a task.
Learners produce: Final technical prototype.
Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.

Minimum coverage before moving on

Learners can explain the module vocabulary without relying on tool-generated text.
Learners have seen one worked example, one hands-on application, and one limitation or failure case.
Learners know what must be verified, what data must be protected, and who remains accountable for the output.

Labs, projects, and assessments

Lab 1: Token/context budgeting and decoding experiments.
Lab 2: Structured output extractor with schema validation and failure handling.
Lab 3: Mini RAG system with evaluation set and citation checks.
Lab 4: Tool-calling assistant with logs and retries.
Capstone: Inference-time application with eval harness, cost/latency notes, safety tests, and deployment plan.

Evaluation approach

15% conceptual quizzes.
20% structured output and decoding labs.
25% RAG/tool-use implementation.
20% evaluation harness.
20% final prototype and technical report.

Recommended tools and materials

Python, notebooks, LLM APIs or local models, vector database, embedding model, LangChain/LlamaIndex or lightweight custom harness, FastAPI/Streamlit optional.
Optional: OpenTelemetry-style tracing, promptfoo/evals tools, Docker.

Safety, ethics, and governance emphasis

Teach prompt injection and data exfiltration as first-class risks.
Use synthetic or public datasets for labs.
Do not connect tools with real-world side effects until permissions, logging, and human approval are implemented.

Delivery notes

This course should be code-heavy and lab-driven.
A lighter technical-PM version can remove implementation details and focus on architecture, tradeoffs, and vendor evaluation.

Instructor Build Checklist

Prepare one short demo for each module and one learner activity that creates a saved artifact.
Prepare examples that match the audience, local context, and likely tools learners can access.
Add a verification step to every AI-generated output: factual check, source check, data sensitivity check, and quality review.
Keep a running portfolio folder so each module contributes to the final project or learner playbook.
Reserve time for reflection on what the learner did, what AI did, what was checked, and what remains uncertain.