--- name: ai-llm description: > Use this skill for any AI or LLM-related development task. Triggers include: building apps that call LLM APIs (OpenAI, Anthropic, Gemini, etc.), designing prompt templates, chaining prompts, building agents or tool-use systems, setting up RAG pipelines, embeddings, vector search, LLM evaluation, fine-tuning workflows, or any task where Claude is helping build something that itself uses a language model. Use this skill even for partial AI tasks like "add an AI feature to my app" or "help me write a better system prompt". If the user mentions GPT, Claude API, LangChain, LlamaIndex, embeddings, RAG, agents, or prompt engineering — always trigger this skill. --- # AI / LLM Development Skill This skill guides Claude to produce high-quality, production-grade AI/LLM code, prompts, and architecture — avoiding common pitfalls and following best practices across the full LLM stack. --- ## 1. Identify the Task Type Before writing anything, classify what the user needs: | Task | Go To | |---|---| | Prompt design / system prompt | § Prompt Engineering | | API integration (any LLM provider) | § API Integration | | Agent / tool-use system | § Agents & Tool Use | | RAG / embeddings / vector search | § RAG Pipeline | | LLM evaluation / testing | § Evaluation | | Fine-tuning | § Fine-Tuning | --- ## 2. Prompt Engineering ### Principles - **Be explicit**: State the role, task, constraints, and output format clearly. - **Use delimiters**: Separate instructions from content with ``, triple backticks, or `---`. - **Output format first**: Specify JSON, markdown, plain text, etc. upfront. - **Few-shot when needed**: Include 2–3 examples for non-obvious tasks. - **Chain of thought**: Add "think step by step" for reasoning tasks. ### System Prompt Template ``` You are a [ROLE] that [PRIMARY PURPOSE]. ## Task [Clear description of what the model should do] ## Rules - [Constraint 1] - [Constraint 2] ## Output Format [Exact format, schema, or example] ``` ### Common Prompt Patterns - **Classification**: Include all possible labels and definitions. - **Extraction**: Provide JSON schema in the prompt. - **Summarization**: Specify length, style, and what to preserve. - **Code generation**: Specify language, libraries, and style guide. --- ## 3. API Integration ### Anthropic (Claude) ```python import anthropic client = anthropic.Anthropic(api_key="YOUR_KEY") response = client.messages.create( model="claude-sonnet-4-20250514", # recommended default max_tokens=1024, system="Your system prompt here", messages=[{"role": "user", "content": "User message"}] ) print(response.content[0].text) ``` ### OpenAI ```python from openai import OpenAI client = OpenAI(api_key="YOUR_KEY") response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "System prompt"}, {"role": "user", "content": "User message"} ] ) print(response.choices[0].message.content) ``` ### Best Practices - Always set `max_tokens` to avoid runaway costs. - Use **structured outputs** (JSON mode / response_format) for parseable responses. - Implement **retry logic** with exponential backoff for rate limits. - Store API keys in environment variables, never hardcode. - Log token usage for cost tracking. ```python # Retry pattern import time, random def call_with_retry(fn, retries=3): for i in range(retries): try: return fn() except RateLimitError: time.sleep((2 ** i) + random.random()) raise Exception("Max retries exceeded") ``` --- ## 4. Agents & Tool Use ### When to use agents - Multi-step tasks that require dynamic decisions - Tasks that need external data (search, DB, APIs) - Tasks where the number of steps isn't known upfront ### Tool Definition Pattern (Anthropic) ```python tools = [ { "name": "search_web", "description": "Search the web for current information", "input_schema": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"} }, "required": ["query"] } } ] response = client.messages.create( model="claude-sonnet-4-20250514", tools=tools, messages=[{"role": "user", "content": "..."}] ) ``` ### Agent Loop Pattern ```python messages = [{"role": "user", "content": user_input}] while True: response = client.messages.create(model=MODEL, tools=tools, messages=messages) if response.stop_reason == "end_turn": break # Process tool calls tool_results = [] for block in response.content: if block.type == "tool_use": result = execute_tool(block.name, block.input) tool_results.append({"type": "tool_result", "tool_use_id": block.id, "content": result}) messages.append({"role": "assistant", "content": response.content}) messages.append({"role": "user", "content": tool_results}) ``` --- ## 5. RAG Pipeline ### Architecture ``` Documents → Chunking → Embedding → Vector Store ↓ User Query → Embed Query → Similarity Search → Top-K Chunks ↓ LLM (chunks + query) → Answer ``` ### Chunking Strategy - **Fixed size**: 512–1024 tokens with 10–20% overlap — good default. - **Semantic**: Split on paragraphs/sections — better for structured docs. - **Recursive**: LangChain's `RecursiveCharacterTextSplitter` — best general choice. ### Embedding Models | Provider | Model | Dims | Use Case | |---|---|---|---| | OpenAI | text-embedding-3-small | 1536 | General, cost-effective | | OpenAI | text-embedding-3-large | 3072 | High accuracy | | Cohere | embed-v3 | 1024 | Multilingual | | Local | nomic-embed-text | 768 | Privacy / offline | ### Vector Stores - **Pinecone** — managed, production-ready - **Weaviate** — open-source, hybrid search - **pgvector** — if already using Postgres - **ChromaDB** — local dev / prototyping - **FAISS** — in-memory, no server needed ### Retrieval Tips - Use **hybrid search** (vector + keyword) for better recall. - Add **metadata filters** to narrow search space. - Re-rank results with a cross-encoder for precision. - Always include source attribution in the prompt. --- ## 6. Evaluation ### What to measure | Metric | Tool | |---|---| | Correctness (vs ground truth) | LLM-as-judge, RAGAS | | Faithfulness (RAG) | RAGAS | | Latency / cost | Custom logging | | Hallucination rate | LLM-as-judge | ### LLM-as-Judge Pattern ```python JUDGE_PROMPT = """ You are an evaluator. Given a question, reference answer, and model answer, rate the model answer on a scale of 1–5 for correctness. Return JSON: {"score": int, "reason": str} Question: {question} Reference: {reference} Model Answer: {answer} """ ``` ### Regression Testing - Store (input, expected_output) pairs in a JSON file. - Run evals on every prompt change. - Track score trends over time. --- ## 7. Fine-Tuning ### When to fine-tune (vs prompt engineering) - Fine-tune when: consistent style/format needed, prompt is too long, latency matters. - Prompt first: fine-tuning is expensive and slow to iterate. ### Data Requirements - Minimum ~50–100 examples; 500–1000 for reliable improvement. - Format: JSONL with `{"messages": [...]}` structure. - Balance classes for classification tasks. --- ## 8. Cost & Latency Optimization - **Cache** repeated prompts (Anthropic prompt caching, Redis). - **Stream** responses for better UX (`stream=True`). - **Batch** requests when real-time isn't needed. - Use **smaller models** for classification/routing; larger for generation. - Set **temperature=0** for deterministic/factual tasks. --- ## 9. Safety & Production Checklist - [ ] API keys in environment variables - [ ] Rate limit handling with retry - [ ] Max token limits set - [ ] Input validation before sending to LLM - [ ] Output validation / parsing with fallback - [ ] Logging (inputs, outputs, latency, cost) - [ ] Prompt injection protection for user-facing apps - [ ] PII scrubbing if handling sensitive data