17. AI Inference Course: How Modern AI Systems Think, Retrieve, Call Tools, and Serve Users
Course Positioning
This is a technical course on what happens after a model exists: prompting, tokenization, decoding, context management, retrieval, tool use, structured outputs, evaluation, serving, latency, cost, safety, and inference-time optimization. It is ideal for people building AI products without training frontier models.
Learning outcomes
- Explain the inference-time behavior of transformer language models: tokens, logits, sampling, context windows, and decoding.
- Build reliable LLM applications using prompt engineering, RAG, function calling, structured outputs, and guardrails.
- Evaluate model outputs with task-specific metrics, golden datasets, human review, and automated checks.
- Optimize for latency, cost, reliability, safety, and user experience.
- Prototype an inference-time system that improves model usefulness without retraining.
Expanded Topic-by-Topic Coverage
Module 1. The inference stack
Module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer. Primary live activity or lab: Trace a user request through an LLM app architecture. Expected take-home output: Inference system diagram.
Topics and coverage
Model
- What it means: place Model inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
- What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
- Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
- Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.
tokenizer
- What it means: define tokenizer clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
prompt
- What it means: explain how prompt changes the interaction between human intent, model behavior, external information, and final output.
- What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
- Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
- Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.
context
- What it means: define context clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
decoding
- What it means: define decoding clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
output parser
- What it means: define output parser clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
tools
- What it means: define tools clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
retrieval
- What it means: explain how retrieval changes the interaction between human intent, model behavior, external information, and final output.
- What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
- Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
- Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.
application layer
- What it means: define application layer clearly and connect it to the module focus: Model, tokenizer, prompt, context, decoding, output parser, tools, retrieval, application layer.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Trace a user request through an LLM app architecture.
- Learners produce: Inference system diagram.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Module 2. Tokenization and context
Module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes. Primary live activity or lab: Inspect tokenization and design context budget. Expected take-home output: Context budget plan.
Topics and coverage
Tokens
- What it means: define Tokens clearly and connect it to the module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
context windows
- What it means: define context windows clearly and connect it to the module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
truncation
- What it means: define truncation clearly and connect it to the module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
prompt packing
- What it means: explain how prompt packing changes the interaction between human intent, model behavior, external information, and final output.
- What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
- Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
- Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.
long-context failure modes
- What it means: define long-context failure modes clearly and connect it to the module focus: Tokens, context windows, truncation, prompt packing, long-context failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Inspect tokenization and design context budget.
- Learners produce: Context budget plan.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Module 3. Decoding and sampling
Module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity. Primary live activity or lab: Compare outputs across decoding settings. Expected take-home output: Decoding experiment log.
Topics and coverage
Greedy decoding
- What it means: define Greedy decoding clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
temperature
- What it means: define temperature clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
top-p
- What it means: define top-p clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
top-k
- What it means: define top-k clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
beams
- What it means: define beams clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
repetition
- What it means: define repetition clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
determinism
- What it means: define determinism clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
diversity
- What it means: define diversity clearly and connect it to the module focus: Greedy decoding, temperature, top-p, top-k, beams, repetition, determinism, diversity.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Compare outputs across decoding settings.
- Learners produce: Decoding experiment log.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Module 4. Prompt programs and structured outputs
Module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation. Primary live activity or lab: Build a structured extraction prompt with validation. Expected take-home output: Schema-based extractor.
Topics and coverage
System/developer/user prompts
- What it means: explain how System/developer/user prompts changes the interaction between human intent, model behavior, external information, and final output.
- What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
- Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
- Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.
schemas
- What it means: define schemas clearly and connect it to the module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
JSON
- What it means: define JSON clearly and connect it to the module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
constrained decoding
- What it means: define constrained decoding clearly and connect it to the module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
output validation
- What it means: define output validation clearly and connect it to the module focus: System/developer/user prompts, schemas, JSON, constrained decoding, output validation.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Build a structured extraction prompt with validation.
- Learners produce: Schema-based extractor.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Module 5. Retrieval augmented generation
Module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes. Primary live activity or lab: Build a mini RAG pipeline over sample documents. Expected take-home output: RAG prototype.
Topics and coverage
Embeddings
- What it means: explain how Embeddings changes the interaction between human intent, model behavior, external information, and final output.
- What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
- Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
- Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.
chunking
- What it means: define chunking clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
vector search
- What it means: define vector search clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
hybrid search
- What it means: define hybrid search clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
reranking
- What it means: define reranking clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
citations
- What it means: define citations clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
grounding
- What it means: define grounding clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
failure modes
- What it means: define failure modes clearly and connect it to the module focus: Embeddings, chunking, vector search, hybrid search, reranking, citations, grounding, failure modes.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Build a mini RAG pipeline over sample documents.
- Learners produce: RAG prototype.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Module 6. Tool use and function calling
Module focus: APIs, calculators, search, databases, tool selection, error handling, permissions. Primary live activity or lab: Create a function-calling workflow for a simple task. Expected take-home output: Tool-using assistant.
Topics and coverage
APIs
- What it means: define APIs clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
calculators
- What it means: define calculators clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
search
- What it means: define search clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
databases
- What it means: connect databases to the data lifecycle from source and structure through analysis, interpretation, and decision-making.
- What to cover: source reliability, missing or biased data, leakage, assumptions, calculations, and the difference between correlation and decision-ready evidence.
- Demonstration: walk through a small dataset or example table and mark the checks required before trusting the result.
- Evidence of learning: learners produce a short analysis note that includes assumptions, limitations, and verification steps.
tool selection
- What it means: define tool selection clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
error handling
- What it means: define error handling clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
permissions
- What it means: define permissions clearly and connect it to the module focus: APIs, calculators, search, databases, tool selection, error handling, permissions.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Create a function-calling workflow for a simple task.
- Learners produce: Tool-using assistant.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Module 7. Evaluation and observability
Module focus: Gold sets, unit tests for prompts, LLM-as-judge caveats, regression testing, logging, traces. Primary live activity or lab: Design an evaluation set and scoring rubric. Expected take-home output: Eval harness.
Topics and coverage
Gold sets
- What it means: define Gold sets clearly and connect it to the module focus: Gold sets, unit tests for prompts, LLM-as-judge caveats, regression testing, logging, traces.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
unit tests for prompts
- What it means: explain how unit tests for prompts changes the interaction between human intent, model behavior, external information, and final output.
- What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
- Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
- Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.
LLM-as-judge caveats
- What it means: explain how LLM-as-judge caveats changes the interaction between human intent, model behavior, external information, and final output.
- What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
- Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
- Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.
regression testing
- What it means: place regression testing inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
- What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
- Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
- Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.
logging
- What it means: define logging clearly and connect it to the module focus: Gold sets, unit tests for prompts, LLM-as-judge caveats, regression testing, logging, traces.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
traces
- What it means: define traces clearly and connect it to the module focus: Gold sets, unit tests for prompts, LLM-as-judge caveats, regression testing, logging, traces.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Design an evaluation set and scoring rubric.
- Learners produce: Eval harness.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Module 8. Latency, cost, and reliability
Module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits. Primary live activity or lab: Estimate cost/latency for three architectures. Expected take-home output: Serving plan.
Topics and coverage
Caching
- What it means: define Caching clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
batching
- What it means: define batching clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
streaming
- What it means: define streaming clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
model routing
- What it means: place model routing inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
- What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
- Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
- Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.
smaller models
- What it means: place smaller models inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
- What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
- Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
- Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.
fallbacks
- What it means: define fallbacks clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
rate limits
- What it means: define rate limits clearly and connect it to the module focus: Caching, batching, streaming, model routing, smaller models, fallbacks, rate limits.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Estimate cost/latency for three architectures.
- Learners produce: Serving plan.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Module 9. Safety and guardrails
Module focus: Input filtering, output moderation, prompt injection, data leakage, privacy, abuse cases. Primary live activity or lab: Red-team a RAG/tool app. Expected take-home output: Safety test report.
Topics and coverage
Input filtering
- What it means: define Input filtering clearly and connect it to the module focus: Input filtering, output moderation, prompt injection, data leakage, privacy, abuse cases.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
output moderation
- What it means: define output moderation clearly and connect it to the module focus: Input filtering, output moderation, prompt injection, data leakage, privacy, abuse cases.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
prompt injection
- What it means: explain how prompt injection changes the interaction between human intent, model behavior, external information, and final output.
- What to cover: inputs, constraints, examples, output format, grounding, iteration, failure modes, and when a human must intervene.
- Demonstration: show a weak attempt, a stronger structured attempt, and a reviewed final version with explicit checks.
- Evidence of learning: learners create a reusable prompt, schema, retrieval note, or workflow pattern and test it on at least two examples.
data leakage
- What it means: connect data leakage to the data lifecycle from source and structure through analysis, interpretation, and decision-making.
- What to cover: source reliability, missing or biased data, leakage, assumptions, calculations, and the difference between correlation and decision-ready evidence.
- Demonstration: walk through a small dataset or example table and mark the checks required before trusting the result.
- Evidence of learning: learners produce a short analysis note that includes assumptions, limitations, and verification steps.
privacy
- What it means in this course: define privacy in operational terms, not as an abstract principle.
- What to cover: sensitive data boundaries, affected stakeholders, approval paths, documentation, and what engineers, technical PMs, researchers, advanced students, founders must never delegate blindly to AI.
- Use case: present one acceptable use, one borderline use, and one prohibited use, then ask learners to justify the classification.
- Evidence of learning: learners add a risk control, review step, or escalation rule to their course project.
abuse cases
- What it means: define abuse cases clearly and connect it to the module focus: Input filtering, output moderation, prompt injection, data leakage, privacy, abuse cases.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Red-team a RAG/tool app.
- Learners produce: Safety test report.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Module 10. Inference-time optimization project
Module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning. Primary live activity or lab: Improve a baseline assistant on a task. Expected take-home output: Final technical prototype.
Topics and coverage
Self-consistency
- What it means: define Self-consistency clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
critique loops
- What it means: define critique loops clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
verifier models
- What it means: place verifier models inside the AI system stack so learners know what problem it solves and what tradeoffs it introduces.
- What to cover: inputs, outputs, system boundaries, evaluation criteria, cost or latency implications, and common failure cases.
- Demonstration: use a diagram, small code sample, worksheet, or tool trace to make the mechanism visible.
- Evidence of learning: learners compare two approaches and explain which one they would choose for a realistic constraint.
search
- What it means: define search clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
reranking
- What it means: define reranking clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
memory
- What it means: define memory clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
planning
- What it means: define planning clearly and connect it to the module focus: Self-consistency, critique loops, verifier models, search, reranking, memory, planning.
- What to cover: the core concept, why it matters, what good usage looks like, and where learners are likely to misunderstand it.
- Demonstration: give one simple example, one realistic example, and one failure or limitation example.
- Evidence of learning: learners explain the topic in their own words and apply it to a small artifact or decision.
Practice and evidence of learning
- Learners complete or discuss: Improve a baseline assistant on a task.
- Learners produce: Final technical prototype.
- Instructor checks for accuracy, practical usefulness, clear assumptions, appropriate human review, and fit with the course audience.
- Learners revise once after feedback so the module contributes to the final project, portfolio, or playbook.
Minimum coverage before moving on
- Learners can explain the module vocabulary without relying on tool-generated text.
- Learners have seen one worked example, one hands-on application, and one limitation or failure case.
- Learners know what must be verified, what data must be protected, and who remains accountable for the output.
Labs, projects, and assessments
- Lab 1: Token/context budgeting and decoding experiments.
- Lab 2: Structured output extractor with schema validation and failure handling.
- Lab 3: Mini RAG system with evaluation set and citation checks.
- Lab 4: Tool-calling assistant with logs and retries.
- Capstone: Inference-time application with eval harness, cost/latency notes, safety tests, and deployment plan.
Evaluation approach
- 15% conceptual quizzes.
- 20% structured output and decoding labs.
- 25% RAG/tool-use implementation.
- 20% evaluation harness.
- 20% final prototype and technical report.
Recommended tools and materials
- Python, notebooks, LLM APIs or local models, vector database, embedding model, LangChain/LlamaIndex or lightweight custom harness, FastAPI/Streamlit optional.
- Optional: OpenTelemetry-style tracing, promptfoo/evals tools, Docker.
Safety, ethics, and governance emphasis
- Teach prompt injection and data exfiltration as first-class risks.
- Use synthetic or public datasets for labs.
- Do not connect tools with real-world side effects until permissions, logging, and human approval are implemented.
Delivery notes
- This course should be code-heavy and lab-driven.
- A lighter technical-PM version can remove implementation details and focus on architecture, tradeoffs, and vendor evaluation.
Instructor Build Checklist
- Prepare one short demo for each module and one learner activity that creates a saved artifact.
- Prepare examples that match the audience, local context, and likely tools learners can access.
- Add a verification step to every AI-generated output: factual check, source check, data sensitivity check, and quality review.
- Keep a running portfolio folder so each module contributes to the final project or learner playbook.
- Reserve time for reflection on what the learner did, what AI did, what was checked, and what remains uncertain.