Every production LLM application eventually hits the same wall: you've got unstructured text and you need to turn it into structured data your code can actually work with. Customer support tickets need to become categorized records. Invoices need to become line items in a database. Research papers need to become searchable metadata.
And if you've ever tried doing this by parsing raw LLM output with regex or hand-rolled JSON cleanup, you know how brittle that gets at scale. I've been there — it works fine until it doesn't, usually on the weirdest edge case at the worst possible time.
Instructor solves this problem. It's a Python library — with over 3 million monthly downloads as of early 2026 — that bridges LLM output and Pydantic models. You define a schema, Instructor sends it to the LLM, validates the response, and retries automatically if validation fails. No JSON parsing code, no retry loops, no string manipulation. Just define what you want and get it back as a typed Python object.
This guide walks you through building production-grade data extraction pipelines with Instructor and Pydantic. We'll cover everything from basic schema design through streaming extraction, batch processing, multimodal document handling, and the operational patterns that keep these pipelines reliable at scale. Every example uses Instructor's current API as of April 2026.
Why Instructor for Data Extraction
Before Instructor, the typical approach to extracting structured data from an LLM involved writing a prompt that asked for JSON output, parsing the response string, handling malformed JSON with repair libraries, wrapping everything in retry logic, and then still dealing with occasional type mismatches or missing fields in production. That stack of workarounds was the norm in 2023 and early 2024.
It worked — until it didn't, usually at 3 AM when a slightly different input format caused a cascade of parsing failures.
Instructor collapses that entire stack into a single function call. Under the hood, it converts your Pydantic model into a JSON schema, sends that schema to the LLM as structured output instructions, validates the response against your model, and retries with the validation error message if the response is invalid. The LLM sees exactly what went wrong and corrects itself.
Here's why I'd pick Instructor over other approaches:
- Single responsibility — Unlike LangChain or LlamaIndex, Instructor does one thing: structured extraction. That means a smaller dependency footprint, fewer abstractions to learn, and simpler debugging when something goes wrong.
- Provider agnostic — The same code works with OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Ollama, DeepSeek, and over 15 other providers through a unified
from_providerAPI. - Pydantic native — If you already use Pydantic in your FastAPI or Django application, your existing models work directly with Instructor. No translation layer needed.
- Production tested — Organizations like the London Stock Exchange Group run Instructor in production for mission-critical data extraction in financial applications.
Installation and Setup
Install Instructor with your preferred LLM provider:
# Core package
pip install instructor
# With specific provider extras
pip install "instructor[anthropic]"
pip install "instructor[google-generativeai]"
pip install "instructor[cohere]"
# Or install everything
pip install "instructor[all]"
Set your API key as an environment variable. Instructor reads standard provider environment variables by convention:
export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="sk-ant-..."
# or
export GOOGLE_API_KEY="..."
Pretty straightforward. No config files, no initialization ceremony — just install and go.
Basic Extraction: Defining Your First Schema
The core pattern is always the same: define a Pydantic model, create an Instructor client, and call create. Here's a practical example that extracts product information from unstructured text:
import instructor
from pydantic import BaseModel, Field
from typing import Optional
from enum import Enum
class ProductCategory(str, Enum):
ELECTRONICS = "electronics"
CLOTHING = "clothing"
FOOD = "food"
SOFTWARE = "software"
OTHER = "other"
class ProductInfo(BaseModel):
name: str = Field(description="Product name as mentioned in the text")
price: float = Field(ge=0, description="Price in USD")
category: ProductCategory
in_stock: bool = Field(description="Whether the product is available")
description: Optional[str] = Field(
None, max_length=200, description="Brief product description"
)
client = instructor.from_provider("openai/gpt-4o-mini")
product = client.chat.completions.create(
response_model=ProductInfo,
messages=[
{
"role": "user",
"content": (
"We just got the new Sony WH-1000XM6 headphones in stock. "
"Premium noise cancelling, 40-hour battery life. "
"Retail price is $349.99."
),
}
],
)
print(product.name) # Sony WH-1000XM6
print(product.price) # 349.99
print(product.category) # ProductCategory.ELECTRONICS
print(product.in_stock) # True
Notice what you didn't write: no JSON parsing, no output format instructions in the prompt, no type casting, no error handling for malformed responses. Instructor handled all of it. The Field descriptions become part of the schema that the LLM sees, guiding extraction accuracy.
Honestly, the first time I saw this work, it felt almost too simple. But that's kind of the point.
Schema Design Tips That Improve Accuracy
The quality of your extraction is directly tied to how well you design your Pydantic models. Here are a few patterns that consistently improve results:
- Use enums for categorical fields — Instead of
category: str, define an enum. This constrains the LLM to valid values and eliminates inconsistencies like "Electronics" vs "electronic" vs "tech". - Add Field descriptions — Pydantic
Field(description=...)annotations become part of the JSON schema the LLM receives. Clear descriptions dramatically improve extraction accuracy for ambiguous fields. - Use validators for business rules — Pydantic validators let you enforce domain-specific constraints that the LLM can't know about, such as price ranges, date formats, or cross-field dependencies.
- Keep schemas focused — Extract one document type per schema. A single "extract everything" schema leads to lower accuracy than multiple focused schemas chained together.
Nested Schemas and Complex Structures
Real extraction tasks rarely produce flat data. Instructor handles arbitrarily nested Pydantic models, so you can model complex document structures naturally:
from pydantic import BaseModel, Field
from typing import List, Optional
import instructor
class LineItem(BaseModel):
description: str
quantity: int = Field(ge=1)
unit_price: float = Field(ge=0)
total: float = Field(ge=0)
class Address(BaseModel):
street: str
city: str
state: Optional[str] = None
postal_code: str
country: str = Field(default="US")
class Invoice(BaseModel):
invoice_number: str
date: str = Field(description="Invoice date in YYYY-MM-DD format")
vendor: str
billing_address: Address
line_items: List[LineItem]
subtotal: float
tax: float = Field(ge=0)
total: float
@property
def computed_total(self) -> float:
return self.subtotal + self.tax
client = instructor.from_provider("openai/gpt-4o")
invoice = client.chat.completions.create(
response_model=Invoice,
messages=[
{
"role": "system",
"content": "Extract invoice details from the provided text.",
},
{
"role": "user",
"content": """
Invoice #INV-2026-0847
Date: March 15, 2026
From: Acme Cloud Services
Bill To:
742 Innovation Drive, Suite 300
Austin, TX 78701
Items:
- GPU Compute (24hrs): 3 units x $42.00 = $126.00
- Storage (1TB/mo): 1 unit x $23.50 = $23.50
- API Calls (1M): 2 units x $8.00 = $16.00
Subtotal: $165.50
Tax (8.25%): $13.65
Total: $179.15
""",
},
],
)
print(f"Invoice {invoice.invoice_number}: ${invoice.total}")
for item in invoice.line_items:
print(f" {item.description}: {item.quantity} x ${item.unit_price}")
The nested Address and List[LineItem] fields are extracted correctly because the JSON schema Instructor generates preserves the full type hierarchy. The LLM understands what goes where — even when the source document uses a totally different layout from what you might expect.
Automatic Validation and Retry Logic
This is where Instructor really earns its keep in production. When the LLM returns data that fails Pydantic validation — a negative price, a string where an integer was expected, a value outside your enum — Instructor sends the validation error back to the LLM as a correction prompt and retries. This self-correction loop is configurable:
import instructor
from pydantic import BaseModel, Field, field_validator
class TransactionRecord(BaseModel):
merchant: str
amount: float = Field(gt=0, description="Transaction amount, must be positive")
currency: str = Field(pattern=r"^[A-Z]{3}$", description="ISO 4217 currency code")
date: str = Field(description="Date in YYYY-MM-DD format")
@field_validator("date")
@classmethod
def validate_date_format(cls, v: str) -> str:
from datetime import datetime
try:
datetime.strptime(v, "%Y-%m-%d")
except ValueError:
raise ValueError(f"Date must be YYYY-MM-DD format, got: {v}")
return v
client = instructor.from_provider("openai/gpt-4o-mini")
# max_retries controls how many correction attempts Instructor makes
record = client.chat.completions.create(
response_model=TransactionRecord,
max_retries=3,
messages=[
{
"role": "user",
"content": "Coffee purchase at Blue Bottle, twelve dollars and fifty cents, March 8th 2026, paid in US dollars",
}
],
)
print(record)
# TransactionRecord(merchant='Blue Bottle', amount=12.5, currency='USD', date='2026-03-08')
If the first attempt returns "date": "March 8th, 2026" (wrong format), the validator raises an error, Instructor feeds the error back to the LLM, and the second attempt returns "date": "2026-03-08". You get the correct output without writing any retry logic yourself.
For more granular retry control, Instructor integrates with Tenacity (the standard Python retry library). You can configure exponential backoff, custom wait strategies, and retry conditions using the same patterns you'd use anywhere else in your codebase.
Streaming Partial Responses
When extracting data from long documents, waiting for the complete response before showing anything to the user creates a pretty terrible experience. Instructor supports streaming partial responses — you get incremental updates as the LLM generates each field:
import instructor
from pydantic import BaseModel, Field
from typing import List, Optional
class ArticleSummary(BaseModel):
title: str
author: Optional[str] = None
key_findings: List[str] = Field(description="Main findings or conclusions")
methodology: Optional[str] = None
relevance_score: float = Field(
ge=0, le=1, description="Relevance to our research area"
)
client = instructor.from_provider("openai/gpt-4o-mini")
# create_partial returns a generator that yields partial objects
for partial_summary in client.chat.completions.create_partial(
response_model=ArticleSummary,
messages=[
{
"role": "user",
"content": "Summarize this research paper: [long paper text here]",
}
],
):
# Each yield has whatever fields are available so far
if partial_summary.title:
print(f"Title: {partial_summary.title}")
if partial_summary.key_findings:
print(f"Findings so far: {len(partial_summary.key_findings)}")
# Update your UI progressively
This is particularly valuable for real-time applications where you need to render extraction results in a UI as they arrive. Each yielded object is a valid ArticleSummary instance with whatever fields have been populated so far — no partial JSON to parse or incomplete objects to deal with.
Multi-Provider Extraction
One of Instructor's strongest production features is its unified API across LLM providers. The same Pydantic schema and extraction logic works identically regardless of which model you're calling. This enables provider fallback chains and model-specific routing without duplicating your extraction code:
import instructor
from pydantic import BaseModel, Field
from typing import List
class SupportTicket(BaseModel):
category: str = Field(description="One of: billing, technical, account, other")
priority: str = Field(description="One of: low, medium, high, critical")
summary: str = Field(max_length=150)
action_items: List[str]
# Same schema, different providers - just change the provider string
def extract_ticket(text: str, provider: str = "openai/gpt-4o-mini") -> SupportTicket:
client = instructor.from_provider(provider)
return client.chat.completions.create(
response_model=SupportTicket,
messages=[
{"role": "system", "content": "Extract support ticket details."},
{"role": "user", "content": text},
],
)
# Production: use the cheaper model by default
ticket = extract_ticket(customer_email, provider="openai/gpt-4o-mini")
# Fallback to a stronger model if validation fails
# or for complex cases that need more reasoning
ticket = extract_ticket(customer_email, provider="anthropic/claude-sonnet-4-20250514")
# Use a local model for sensitive data that cannot leave your network
ticket = extract_ticket(customer_email, provider="ollama/llama3")
This provider flexibility is critical for production systems. You can route simple extractions to cheaper, faster models and reserve the expensive ones for complex or ambiguous inputs. You can also keep sensitive data on-premises by routing to local models through Ollama — without changing a single line of extraction logic.
That last point is a bigger deal than it sounds. Vendor lock-in with LLM providers is a real concern, and having a clean abstraction layer means you can switch providers (or mix them) without rewriting your pipeline.
Extracting Data from Documents and Images
Modern LLMs handle multimodal input natively, and Instructor supports this for extraction workflows. You can extract structured data directly from PDFs, images, and scanned documents by passing them as multimodal messages:
import instructor
import base64
from pydantic import BaseModel, Field
from typing import List, Optional
class ReceiptItem(BaseModel):
name: str
quantity: int = Field(ge=1, default=1)
price: float = Field(ge=0)
class Receipt(BaseModel):
store_name: str
date: Optional[str] = Field(None, description="Date in YYYY-MM-DD format")
items: List[ReceiptItem]
total: float = Field(ge=0)
payment_method: Optional[str] = None
def extract_from_image(image_path: str) -> Receipt:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
client = instructor.from_provider("openai/gpt-4o")
return client.chat.completions.create(
response_model=Receipt,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract all receipt details from this image.",
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
},
},
],
}
],
)
receipt = extract_from_image("receipt_photo.jpg")
print(f"Store: {receipt.store_name}")
print(f"Total: ${receipt.total}")
for item in receipt.items:
print(f" {item.name}: ${item.price}")
For PDF extraction, you've got two approaches. If the PDF is text-based, extract the text first with a library like PyMuPDF and pass it as a text message. If the PDF contains scanned images or complex layouts, use a multimodal model that supports PDF input directly — Google Gemini handles this particularly well through Instructor's provider integration.
Batch Processing at Scale
Production extraction pipelines rarely process one document at a time. When you need to extract from hundreds or thousands of documents, combine Instructor with Python's async capabilities and batching patterns:
import instructor
import asyncio
from pydantic import BaseModel, Field
from typing import List
class JobPosting(BaseModel):
title: str
company: str
location: str
salary_min: float | None = None
salary_max: float | None = None
required_skills: List[str]
experience_years: int = Field(ge=0)
async def extract_job(client, text: str) -> JobPosting:
return await client.chat.completions.create(
response_model=JobPosting,
messages=[
{"role": "system", "content": "Extract job posting details."},
{"role": "user", "content": text},
],
)
async def process_batch(job_texts: List[str], concurrency: int = 10):
client = instructor.from_provider("openai/gpt-4o-mini", async_=True)
semaphore = asyncio.Semaphore(concurrency)
async def bounded_extract(text: str) -> JobPosting | None:
async with semaphore:
try:
return await extract_job(client, text)
except Exception as e:
print(f"Extraction failed: {e}")
return None
tasks = [bounded_extract(text) for text in job_texts]
results = await asyncio.gather(*tasks)
return [r for r in results if r is not None]
# Process 500 job postings with max 10 concurrent LLM calls
job_texts = [...] # your list of raw job posting texts
results = asyncio.run(process_batch(job_texts, concurrency=10))
print(f"Extracted {len(results)} job postings")
The semaphore limits concurrent API calls to avoid rate limiting. In production, you'd also want to add exponential backoff on rate limit errors, write results to a database incrementally rather than holding everything in memory, and log extraction failures for manual review.
Instructor vs PydanticAI vs Native APIs
This is the question that comes up most often, so let's just be direct about it.
Use Instructor when your primary task is extraction — turning unstructured text into structured data. It's the lightest option, has the smallest learning curve, and is purpose-built for this use case. If your pipeline is "text in, structured data out" without tool calls or multi-step agent logic, Instructor is the right choice.
Use PydanticAI when you need agents that call tools, maintain conversational state, or orchestrate multi-step workflows. PydanticAI adds dependency injection, tool registration, and observability through Logfire on top of Pydantic's type safety. It's heavier than Instructor but essential when your application goes beyond extraction into autonomous action.
Use native provider APIs when you're locked to a single provider and want zero abstraction overhead. OpenAI's structured output mode with strict: True and Anthropic's structured outputs both guarantee schema compliance through constrained decoding. The tradeoff is that you lose provider portability and the automatic retry-on-validation-failure that Instructor provides.
And here's the thing — these tools aren't mutually exclusive. A common production pattern is using Instructor for the extraction stages and PydanticAI for the agentic stages that act on the extracted data. Your Pydantic models work identically in both.
Production Best Practices
Design Extraction as a Pipeline, Not a Single Call
Resist the urge to extract everything in one LLM call. Complex documents perform better when you break extraction into stages: first classify the document type (use an enum output), then run a type-specific extraction schema. This two-stage approach consistently outperforms single-call extraction because the LLM can focus on one task at a time.
from enum import Enum
from pydantic import BaseModel
import instructor
class DocumentType(str, Enum):
INVOICE = "invoice"
CONTRACT = "contract"
RESUME = "resume"
EMAIL = "email"
class Classification(BaseModel):
doc_type: DocumentType
confidence: float
client = instructor.from_provider("openai/gpt-4o-mini")
# Stage 1: Classify
classification = client.chat.completions.create(
response_model=Classification,
messages=[{"role": "user", "content": document_text}],
)
# Stage 2: Extract with type-specific schema
schema_map = {
DocumentType.INVOICE: InvoiceSchema,
DocumentType.CONTRACT: ContractSchema,
DocumentType.RESUME: ResumeSchema,
DocumentType.EMAIL: EmailSchema,
}
result = client.chat.completions.create(
response_model=schema_map[classification.doc_type],
messages=[
{"role": "system", "content": f"Extract {classification.doc_type.value} details."},
{"role": "user", "content": document_text},
],
)
Monitor Validation Error Rates
Track how often Instructor needs to retry. A rising validation error rate signals that your schema doesn't match the data your pipeline is receiving. Log every retry with the original input, the failed response, and the validation error — this data is invaluable for debugging accuracy regressions.
Use Cheaper Models for Simple Extractions
Not every extraction needs GPT-4o or Claude Opus. Simpler schemas — flat structures with a few fields and clear enums — extract reliably with smaller models like GPT-4o-mini or Claude Haiku. Reserve the expensive models for complex nested schemas, ambiguous inputs, or multimodal document extraction. Your wallet will thank you.
Cache Extraction Results
If you're extracting from documents that don't change, cache the results keyed by a content hash. There's no reason to call the LLM twice for the same invoice. This alone can reduce LLM costs by 30-60% in pipelines that reprocess documents.
Frequently Asked Questions
What is the difference between Instructor and OpenAI structured outputs?
OpenAI structured outputs use constrained decoding to guarantee schema compliance for a single provider. Instructor wraps multiple providers with the same interface and adds automatic validation retries — if the LLM returns data that fails Pydantic validation, Instructor sends the error back and retries. OpenAI's native feature doesn't do this. If you only use OpenAI and never need custom validators, native structured outputs work fine. For multi-provider pipelines or complex validation logic, Instructor is the better choice.
Can Instructor extract data from PDFs and images?
Yes. Any multimodal LLM that Instructor supports — GPT-4o, Claude, Gemini — can process images and documents. You pass the image or PDF as a multimodal message and define a Pydantic schema for the output, exactly as you would for text extraction. For text-based PDFs, extracting the text first with PyMuPDF and passing it as a text message is faster and cheaper than the multimodal path.
How does Instructor handle LLM hallucination in extraction?
It doesn't prevent hallucination outright, but it limits its impact. Pydantic validators catch structurally invalid hallucinations — wrong types, values outside allowed ranges, enum violations. For semantic hallucinations (where the LLM invents a plausible but incorrect value), you need domain-specific post-processing: cross-referencing extracted values against a database, flagging low-confidence extractions for human review, or running a second LLM call to verify critical fields.
Is Instructor suitable for high-volume production workloads?
Absolutely. Instructor adds minimal overhead on top of the LLM API call — the validation and retry logic is local Python execution, measured in milliseconds. The bottleneck is always the LLM API latency and rate limits, not Instructor itself. For high-volume workloads, use async clients with concurrency limits, batch processing, and result caching as described in this guide. Organizations process millions of extractions per month with Instructor in production.
Should I use Instructor or PydanticAI for my project?
If your task is pure extraction — unstructured text in, structured data out — use Instructor. If you need an agent that calls external tools, maintains conversation state, or makes multi-step decisions, use PydanticAI. Many production systems use both: Instructor for the extraction stages and PydanticAI for the agentic stages. Your Pydantic models work identically in both libraries, so there's no migration cost if your needs evolve.