---
name: ai-llm
description: >
  Use this skill for any AI or LLM-related development task. Triggers include: building apps
  that call LLM APIs (OpenAI, Anthropic, Gemini, etc.), designing prompt templates, chaining
  prompts, building agents or tool-use systems, setting up RAG pipelines, embeddings, vector
  search, LLM evaluation, fine-tuning workflows, or any task where Claude is helping build
  something that itself uses a language model. Use this skill even for partial AI tasks like
  "add an AI feature to my app" or "help me write a better system prompt". If the user mentions
  GPT, Claude API, LangChain, LlamaIndex, embeddings, RAG, agents, or prompt engineering —
  always trigger this skill.
---

# AI / LLM Development Skill

This skill guides Claude to produce high-quality, production-grade AI/LLM code, prompts, and
architecture — avoiding common pitfalls and following best practices across the full LLM stack.

---

## 1. Identify the Task Type

Before writing anything, classify what the user needs:

| Task | Go To |
|---|---|
| Prompt design / system prompt | § Prompt Engineering |
| API integration (any LLM provider) | § API Integration |
| Agent / tool-use system | § Agents & Tool Use |
| RAG / embeddings / vector search | § RAG Pipeline |
| LLM evaluation / testing | § Evaluation |
| Fine-tuning | § Fine-Tuning |

---

## 2. Prompt Engineering

### Principles
- **Be explicit**: State the role, task, constraints, and output format clearly.
- **Use delimiters**: Separate instructions from content with `<tags>`, triple backticks, or `---`.
- **Output format first**: Specify JSON, markdown, plain text, etc. upfront.
- **Few-shot when needed**: Include 2–3 examples for non-obvious tasks.
- **Chain of thought**: Add "think step by step" for reasoning tasks.

### System Prompt Template
```
You are a [ROLE] that [PRIMARY PURPOSE].

## Task
[Clear description of what the model should do]

## Rules
- [Constraint 1]
- [Constraint 2]

## Output Format
[Exact format, schema, or example]
```

### Common Prompt Patterns
- **Classification**: Include all possible labels and definitions.
- **Extraction**: Provide JSON schema in the prompt.
- **Summarization**: Specify length, style, and what to preserve.
- **Code generation**: Specify language, libraries, and style guide.

---

## 3. API Integration

### Anthropic (Claude)
```python
import anthropic

client = anthropic.Anthropic(api_key="YOUR_KEY")

response = client.messages.create(
    model="claude-sonnet-4-20250514",  # recommended default
    max_tokens=1024,
    system="Your system prompt here",
    messages=[{"role": "user", "content": "User message"}]
)
print(response.content[0].text)
```

### OpenAI
```python
from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "System prompt"},
        {"role": "user", "content": "User message"}
    ]
)
print(response.choices[0].message.content)
```

### Best Practices
- Always set `max_tokens` to avoid runaway costs.
- Use **structured outputs** (JSON mode / response_format) for parseable responses.
- Implement **retry logic** with exponential backoff for rate limits.
- Store API keys in environment variables, never hardcode.
- Log token usage for cost tracking.

```python
# Retry pattern
import time, random

def call_with_retry(fn, retries=3):
    for i in range(retries):
        try:
            return fn()
        except RateLimitError:
            time.sleep((2 ** i) + random.random())
    raise Exception("Max retries exceeded")
```

---

## 4. Agents & Tool Use

### When to use agents
- Multi-step tasks that require dynamic decisions
- Tasks that need external data (search, DB, APIs)
- Tasks where the number of steps isn't known upfront

### Tool Definition Pattern (Anthropic)
```python
tools = [
    {
        "name": "search_web",
        "description": "Search the web for current information",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    tools=tools,
    messages=[{"role": "user", "content": "..."}]
)
```

### Agent Loop Pattern
```python
messages = [{"role": "user", "content": user_input}]

while True:
    response = client.messages.create(model=MODEL, tools=tools, messages=messages)
    
    if response.stop_reason == "end_turn":
        break
    
    # Process tool calls
    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = execute_tool(block.name, block.input)
            tool_results.append({"type": "tool_result", "tool_use_id": block.id, "content": result})
    
    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": tool_results})
```

---

## 5. RAG Pipeline

### Architecture
```
Documents → Chunking → Embedding → Vector Store
                                        ↓
User Query → Embed Query → Similarity Search → Top-K Chunks
                                                     ↓
                                         LLM (chunks + query) → Answer
```

### Chunking Strategy
- **Fixed size**: 512–1024 tokens with 10–20% overlap — good default.
- **Semantic**: Split on paragraphs/sections — better for structured docs.
- **Recursive**: LangChain's `RecursiveCharacterTextSplitter` — best general choice.

### Embedding Models
| Provider | Model | Dims | Use Case |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 1536 | General, cost-effective |
| OpenAI | text-embedding-3-large | 3072 | High accuracy |
| Cohere | embed-v3 | 1024 | Multilingual |
| Local | nomic-embed-text | 768 | Privacy / offline |

### Vector Stores
- **Pinecone** — managed, production-ready
- **Weaviate** — open-source, hybrid search
- **pgvector** — if already using Postgres
- **ChromaDB** — local dev / prototyping
- **FAISS** — in-memory, no server needed

### Retrieval Tips
- Use **hybrid search** (vector + keyword) for better recall.
- Add **metadata filters** to narrow search space.
- Re-rank results with a cross-encoder for precision.
- Always include source attribution in the prompt.

---

## 6. Evaluation

### What to measure
| Metric | Tool |
|---|---|
| Correctness (vs ground truth) | LLM-as-judge, RAGAS |
| Faithfulness (RAG) | RAGAS |
| Latency / cost | Custom logging |
| Hallucination rate | LLM-as-judge |

### LLM-as-Judge Pattern
```python
JUDGE_PROMPT = """
You are an evaluator. Given a question, reference answer, and model answer,
rate the model answer on a scale of 1–5 for correctness.
Return JSON: {"score": int, "reason": str}

Question: {question}
Reference: {reference}
Model Answer: {answer}
"""
```

### Regression Testing
- Store (input, expected_output) pairs in a JSON file.
- Run evals on every prompt change.
- Track score trends over time.

---

## 7. Fine-Tuning

### When to fine-tune (vs prompt engineering)
- Fine-tune when: consistent style/format needed, prompt is too long, latency matters.
- Prompt first: fine-tuning is expensive and slow to iterate.

### Data Requirements
- Minimum ~50–100 examples; 500–1000 for reliable improvement.
- Format: JSONL with `{"messages": [...]}` structure.
- Balance classes for classification tasks.

---

## 8. Cost & Latency Optimization

- **Cache** repeated prompts (Anthropic prompt caching, Redis).
- **Stream** responses for better UX (`stream=True`).
- **Batch** requests when real-time isn't needed.
- Use **smaller models** for classification/routing; larger for generation.
- Set **temperature=0** for deterministic/factual tasks.

---

## 9. Safety & Production Checklist

- [ ] API keys in environment variables
- [ ] Rate limit handling with retry
- [ ] Max token limits set
- [ ] Input validation before sending to LLM
- [ ] Output validation / parsing with fallback
- [ ] Logging (inputs, outputs, latency, cost)
- [ ] Prompt injection protection for user-facing apps
- [ ] PII scrubbing if handling sensitive data