Announcing our $24M funding, led by Basis Set Ventures →

Products

Developers

Resources

Usecases

Pricing

Docs

Star

blog-content_navbar_get-started

Blog

Start Free

home_primary_get-started

Home

Back

home_primary_get-started

Home

Back

Few-Shot Prompting: Everything You Need to Know in 2026

Author

Taranjeet Singh

Posted In

Miscellaneous

Posted On

February 14, 2026

Summarize with AI

Summarize

Blogs

Summarize

Blogs

Summarize

Blogs

Summarize

Blogs

On This Page

Content

Author

Taranjeet Singh

Posted On

February 14, 2026

Posted In

Miscellaneous

Summarize with AI

Summarize

Blog

Summarize

Blog

Summarize

Blog

Summarize

Blog

Few-shot prompting is a prompt engineering technique where you provide a language model with a small number of input-output examples directly in your prompt to guide its behavior on a new task.

Instead of retraining the model or writing elaborate instructions, you show it what you want through 2 to 5 demonstrations, and it infers the pattern and applies it to your input.

The technique was first formalized in the GPT-3 paper by Brown et al. (2020), titled "Language Models are Few-Shot Learners," which showed that scaling model parameters unlocked the ability to learn new tasks from just a handful of examples at inference time.

This guide covers everything practitioners need to know about few-shot prompting: how it compares to zero-shot and one-shot approaches, concrete examples across four domains you can copy and adapt, research-backed guidance on how many examples to use, whether order matters, and when to skip few-shot entirely.

TLDR:

Few-shot prompting provides 2 to 5 examples in a prompt to teach a model a task through pattern recognition, with no training or fine-tuning required
Research shows strong accuracy gains from 1 to 2 examples, with diminishing returns beyond 4 to 5 (Brown et al., 2020)
Token costs increase linearly with each example while accuracy gains flatten, so example quality matters more than quantity
Combining few-shot examples with chain-of-thought reasoning (showing intermediate steps) dramatically improves performance on complex tasks (Wei et al., 2022)
For production systems, adaptive example selection based on input similarity outperforms static example sets by reducing irrelevant context

	Zero-Shot	One-Shot	Few-Shot
Examples in prompt	0	1	2 to 5+
Token usage	Lowest	Low	Moderate
Best for	Simple, common tasks	Clarifying format	Complex or domain-specific tasks
Output consistency	Variable	Better	Most consistent
When to use	Task is straightforward	Need light guidance	Need precision or specific format

What Is Few-Shot Prompting?

Few-shot prompting is a form of in-context learning. You insert a small set of labeled examples (the "shots") into your prompt so the model can infer the task pattern and apply it to new inputs, all without any weight updates or model retraining.

The term "shot" simply means one example. A 3-shot prompt contains three demonstrations. The model reads these demonstrations, recognizes the input-output pattern, and generates a response that follows the same structure for your new input.

This approach sits between two extremes. On one end, zero-shot prompting relies entirely on the model's pre-trained knowledge with no examples. On the other, fine-tuning permanently updates model weights using a large labeled dataset. Few-shot prompting gives you more control than zero-shot without the cost and complexity of fine-tuning. You can switch between entirely different tasks just by swapping the examples in your prompt.

The underlying mechanism is pattern completion. Large language models are next-token prediction engines. When you provide examples in a consistent format, you shift the probability distribution of the model's output toward completions that match your demonstrated pattern. The model is not "learning" in the training sense. It is recognizing the structure you have laid out and completing the sequence accordingly.

Zero-Shot vs One-Shot vs Few-Shot Prompting

These three techniques form a spectrum of how much guidance you give the model before asking it to perform.

Zero-shot prompting provides no examples. You describe the task and the model relies on its pre-trained knowledge to respond.

Classify the sentiment of the following text as positive, negative, or neutral.
Text: I think the vacation was okay.
Sentiment:

Output: Neutral

Classify the sentiment of the following text as positive, negative, or neutral.
Text: I think the vacation was okay.
Sentiment:

Output: Neutral

Classify the sentiment of the following text as positive, negative, or neutral.
Text: I think the vacation was okay.
Sentiment:

Output: Neutral

This works for simple, well-understood tasks. But for anything ambiguous, specialized, or format-sensitive, the model is guessing at what you want.

One-shot prompting provides a single example before the actual task. One demonstration is often enough to clarify the expected format.

Text: The product is terrible.
Sentiment: Negative
Text: I think the vacation was okay.
Sentiment:

Output: Neutral

Text: The product is terrible.
Sentiment: Negative
Text: I think the vacation was okay.
Sentiment:

Output: Neutral

Text: The product is terrible.
Sentiment: Negative
Text: I think the vacation was okay.
Sentiment:

Output: Neutral

One example clarifies format and task type, but a single demonstration may not cover the range of possible outputs. The model has only one data point to infer your expectations from.

Few-shot prompting provides two or more examples. Multiple demonstrations establish a clear pattern the model can generalize from.

Text: The product is terrible.
Sentiment: Negative
Text: Super helpful, worth it!
Sentiment: Positive
Text: I think the vacation was okay.
Sentiment:

Output: Neutral

Text: The product is terrible.
Sentiment: Negative
Text: Super helpful, worth it!
Sentiment: Positive
Text: I think the vacation was okay.
Sentiment:

Output: Neutral

Text: The product is terrible.
Sentiment: Negative
Text: Super helpful, worth it!
Sentiment: Positive
Text: I think the vacation was okay.
Sentiment:

Output: Neutral

Three things happened in that prompt beyond just "adding examples."

First, the model learned the task (sentiment classification).
Second, it learned the output format (single word, capitalized).
Third, it saw the boundaries of the label space (Positive, Negative, and implicitly Neutral).

This multi-signal training is what makes few-shot prompting more reliable than zero-shot for structured tasks.

How Few-Shot Prompting Works

The mechanism behind few-shot prompting is in-context learning (ICL). Brown et al. (2020) demonstrated in the GPT-3 paper that sufficiently large language models can learn tasks from examples placed in the context window without any gradient updates to the model's parameters.

The process works in four steps:

You compose a prompt containing your examples, each as an input-output pair
You append your new input at the end, following the same format
The model processes the entire sequence and identifies the pattern across your demonstrations
It generates an output that follows the inferred pattern for your new input

Format consistency is what makes this work reliably. If your examples use the structure Text: [input] / Sentiment: [output], the model expects that same structure for your new input and will complete it accordingly. Break the pattern and the model has less signal to work with.

An important finding from Min et al. (2022) in "Rethinking the Role of Demonstrations" is that the label space and input distribution shown in your examples matter more than whether individual labels are correct. Even with randomly assigned labels, providing examples that show the right format and the right distribution of inputs still outperforms zero-shot prompting. This suggests that few-shot examples act more as structural conditioning than as a miniature training set.

Few-Shot Prompting Examples

Concrete examples are where few-shot prompting clicks. Below are four practical demonstrations across different domains, each designed so you can adapt the pattern to your own use case.

Sentiment Classification

This is the standard starting point. Three examples establish the task, the label set, and the output format in one prompt.

Great product, 10/10 // Positive
Didn't work very well // Negative
It's fine, nothing special // Neutral
I LOVED this movie // 

Output: Positive

Great product, 10/10 // Positive
Didn't work very well // Negative
It's fine, nothing special // Neutral
I LOVED this movie // 

Output: Positive

Great product, 10/10 // Positive
Didn't work very well // Negative
It's fine, nothing special // Neutral
I LOVED this movie // 

Output: Positive

Notice the format: [text] // [label]. The model picks up that the output should be a single capitalized word on the same line after //. You did not need to write instructions explaining this. The examples demonstrated it.

Content Generation with Tone Matching

Few-shot prompting is particularly effective for controlling writing style, something that zero-shot instructions struggle with.

Product: Wireless earbuds
Description: Slip these in and the world fades to just you and the music. 12 hours of battery. Zero tangled cords. Your commute just got an upgrade.
Product: Standing desk
Description: Your back called. It wants a standing desk. Adjustable height, solid build, and enough space for two monitors and your coffee habit.
Product: Travel backpack
Description:

Output: Pack light, travel far. This 40L backpack fits carry-on requirements, has a padded laptop sleeve, and enough pockets to keep your passport exactly where you left it

Product: Wireless earbuds
Description: Slip these in and the world fades to just you and the music. 12 hours of battery. Zero tangled cords. Your commute just got an upgrade.
Product: Standing desk
Description: Your back called. It wants a standing desk. Adjustable height, solid build, and enough space for two monitors and your coffee habit.
Product: Travel backpack
Description:

Output: Pack light, travel far. This 40L backpack fits carry-on requirements, has a padded laptop sleeve, and enough pockets to keep your passport exactly where you left it

Product: Wireless earbuds
Description: Slip these in and the world fades to just you and the music. 12 hours of battery. Zero tangled cords. Your commute just got an upgrade.
Product: Standing desk
Description: Your back called. It wants a standing desk. Adjustable height, solid build, and enough space for two monitors and your coffee habit.
Product: Travel backpack
Description:

Output: Pack light, travel far. This 40L backpack fits carry-on requirements, has a padded laptop sleeve, and enough pockets to keep your passport exactly where you left it

The model picked up the brand voice (casual, second-person, benefit-focused) from two examples. Trying to describe that voice in zero-shot instructions would take a paragraph and still produce less consistent results.

Code Generation

Here is where the difference between zero-shot and few-shot becomes measurable. Compare the two approaches for writing a Python function.

Zero-shot prompt:

Write a Python function to calculate the factorial of a number.

Typical output: a basic recursive function with no input validation.

Few-shot prompt:

Example 1:
def add(a, b):
    """Add two numbers and return the result."""
    if not (isinstance(a, (int, float)) and isinstance(b, (int, float))):
        raise ValueError("Both inputs must be numbers.")
    return a + b


Example 2:
def subtract(a, b):
    """Subtract second number from first and return result."""
    if not (isinstance(a, (int, float)) and isinstance(b, (int, float))):
        raise ValueError("Both inputs must be numbers.")
    return a - b

Now write a function to calculate the factorial of a number

Example 1:
def add(a, b):
    """Add two numbers and return the result."""
    if not (isinstance(a, (int, float)) and isinstance(b, (int, float))):
        raise ValueError("Both inputs must be numbers.")
    return a + b


Example 2:
def subtract(a, b):
    """Subtract second number from first and return result."""
    if not (isinstance(a, (int, float)) and isinstance(b, (int, float))):
        raise ValueError("Both inputs must be numbers.")
    return a - b

Now write a function to calculate the factorial of a number

Example 1:
def add(a, b):
    """Add two numbers and return the result."""
    if not (isinstance(a, (int, float)) and isinstance(b, (int, float))):
        raise ValueError("Both inputs must be numbers.")
    return a + b


Example 2:
def subtract(a, b):
    """Subtract second number from first and return result."""
    if not (isinstance(a, (int, float)) and isinstance(b, (int, float))):
        raise ValueError("Both inputs must be numbers.")
    return a - b

Now write a function to calculate the factorial of a number

The few-shot output includes input validation (isinstance checks), a docstring, and an iterative approach matching the demonstrated coding style. The model learned the team's conventions from two examples: always validate inputs, always include docstrings, always use explicit error types.

Structured Data Extraction

This is one of the most practical applications of few-shot prompting: enforcing a specific output structure that would be difficult to describe with instructions alone.

INPUT: Sarah Chen joined Anthropic as a research scientist in March 2024.
OUTPUT: {"name": "Sarah Chen", "company": "Anthropic", "role": "Research Scientist", "date": "March 2024"}
INPUT: Mark left his position at Google DeepMind last quarter.
OUTPUT: {"name": "Mark", "company": "Google DeepMind", "role": null, "date": "last quarter"}
INPUT: Dr. James Wright was recently appointed CTO of Stripe.
OUTPUT:

Output: {"name": "Dr. James Wright", "company": "Stripe", "role": "CTO", "date": "recently"}

INPUT: Sarah Chen joined Anthropic as a research scientist in March 2024.
OUTPUT: {"name": "Sarah Chen", "company": "Anthropic", "role": "Research Scientist", "date": "March 2024"}
INPUT: Mark left his position at Google DeepMind last quarter.
OUTPUT: {"name": "Mark", "company": "Google DeepMind", "role": null, "date": "last quarter"}
INPUT: Dr. James Wright was recently appointed CTO of Stripe.
OUTPUT:

Output: {"name": "Dr. James Wright", "company": "Stripe", "role": "CTO", "date": "recently"}

INPUT: Sarah Chen joined Anthropic as a research scientist in March 2024.
OUTPUT: {"name": "Sarah Chen", "company": "Anthropic", "role": "Research Scientist", "date": "March 2024"}
INPUT: Mark left his position at Google DeepMind last quarter.
OUTPUT: {"name": "Mark", "company": "Google DeepMind", "role": null, "date": "last quarter"}
INPUT: Dr. James Wright was recently appointed CTO of Stripe.
OUTPUT:

Output: {"name": "Dr. James Wright", "company": "Stripe", "role": "CTO", "date": "recently"}

The second example is doing important work here. By showing "role": null when the role is not mentioned (Mark "left" rather than "joined as"), you taught the model how to handle missing fields. That kind of edge case handling is nearly impossible to get right with instructions alone.

Few-Shot Prompting with Chain-of-Thought

Standard few-shot prompting shows input-output pairs. Chain-of-thought (CoT) few-shot prompting adds the reasoning process between input and output. Wei et al. (2022) introduced this in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" and showed that it dramatically improved performance on arithmetic, commonsense, and symbolic reasoning tasks.

The core idea: instead of just showing the answer, show how you arrived at it.

Q: "apple, banana"
A: The last letter of "apple" is "e". The last letter of "banana" is "a". 
   Concatenating them gives "ea". Output: ea
Q: "table, chair, lamp"
A: The last letter of "table" is "e". The last letter of "chair" is "r". 
   The last letter of "lamp" is "p". Concatenating them gives "erp". Output: erp
Q: "cloud, river, stone"
A:

Q: "apple, banana"
A: The last letter of "apple" is "e". The last letter of "banana" is "a". 
   Concatenating them gives "ea". Output: ea
Q: "table, chair, lamp"
A: The last letter of "table" is "e". The last letter of "chair" is "r". 
   The last letter of "lamp" is "p". Concatenating them gives "erp". Output: erp
Q: "cloud, river, stone"
A:

Q: "apple, banana"
A: The last letter of "apple" is "e". The last letter of "banana" is "a". 
   Concatenating them gives "ea". Output: ea
Q: "table, chair, lamp"
A: The last letter of "table" is "e". The last letter of "chair" is "r". 
   The last letter of "lamp" is "p". Concatenating them gives "erp". Output: erp
Q: "cloud, river, stone"
A:

Without the intermediate reasoning steps, models frequently get this wrong. With CoT examples, accuracy jumps because the model generates the intermediate steps first, effectively giving itself a scratchpad to work through the problem before committing to a final answer.

This is directly relevant if you are working on tasks that require multi-step reasoning: math problems, logical deductions, code debugging, or any task where showing your work improves the answer.

How Many Examples Should You Include?

Research consistently shows a pattern: large gains from zero to two examples, then diminishing returns.

Brown et al. (2020) demonstrated this in the GPT-3 paper, where performance on most benchmarks improved sharply with the first few demonstrations and then plateaued. Two to five examples is the practical sweet spot for most tasks.

The token cost tradeoff explains why more is not always better. Each example adds tokens to your prompt. Token costs scale linearly with the number of examples, but accuracy gains flatten after the first few. At production scale, this creates a real cost problem. A prompt with 10 examples costs roughly 5x more than a prompt with 2 examples, but it will not be 5x more accurate.

There is an exception. For highly specialized tasks where you need the model to shift its writing style, adopt domain-specific patterns, or handle unusual edge cases, more examples can still help. Evan Armstrong documented using prompts with 20,000+ tokens of handwritten few-shot examples for production pipelines, achieving consistent quality on tasks that other approaches failed at. The key was that every example in that prompt was carefully crafted and covered a different edge case.

The practical framework: start with 2 to 3 examples. Evaluate output quality. Add more only if you see specific failure modes that an additional example would address. If you find yourself going past 8 examples, you are likely better served by fine-tuning or by implementing dynamic example selection, where a retrieval system picks the most relevant examples for each input rather than including a fixed set every time.

Does the Order of Examples Matter?

Yes. Zhao et al. (2021) showed in "Calibrate Before Use: Improving Few-Shot Performance of Language Models" that reordering the same set of examples could swing GPT-3's accuracy from near state-of-the-art to near random chance, depending on the permutation.

The practical implication: place your most representative or most important example last. Language models apply more attention to recent tokens, so the final example in your prompt has outsized influence on the output. If you are seeing inconsistent results, try reordering your examples before adding more.

Newer and larger models (GPT-4, Claude, Llama 3) are less sensitive to ordering effects than earlier models, but the effect has not disappeared entirely. Testing different orderings remains a worthwhile debugging step when few-shot performance is inconsistent.

When to Use Few-Shot Prompting (and When Not To)

Few-shot prompting works best when zero-shot instructions leave ambiguity about format, style, or task boundaries. Use it when:

You need consistent output formatting or specific structure (JSON, tables, labeled fields)
The task involves domain-specific terminology or conventions the model may not default to
Zero-shot produces inconsistent or partially correct outputs
You need tone or style matching for content generation
Classification tasks have ambiguous or overlapping categories

Skip few-shot prompting when:

Zero-shot already produces reliable outputs (simple factual questions, basic classification of clear-cut inputs)
Your token budget is extremely tight and every token counts
The task requires multi-step reasoning that examples alone cannot teach (use CoT prompting instead, or combine few-shot with CoT)
You have enough labeled data and query volume to justify fine-tuning, which will outperform prompting at scale

A common mistake is defaulting to few-shot for every prompt. If the model already handles your task well with clear instructions and no examples, adding examples increases cost and latency without improving output quality.

Limitations and Common Pitfalls

Context window constraints. Examples consume tokens, which reduces the space available for your actual task input. With long documents or complex inputs, you may not have room for enough examples to establish a pattern. This is especially relevant for models with smaller context windows.

Example quality outweighs quantity. Poorly chosen examples teach the wrong patterns. If your demonstrations contain inconsistencies in format, ambiguous labels, or edge cases that are not representative of real inputs, the model will reproduce those problems.

Majority label bias. If three of your four examples share the same label, the model may default to that label for ambiguous inputs. Balance your example set across the expected output distribution.

Recency bias. Models tend to weight the last example more heavily than earlier ones. If your final example has an unusual characteristic, the model may over-index on that pattern.

Surface-level pattern matching. The model may latch onto formatting cues (length, punctuation, casing) rather than understanding the underlying task logic. This is why examples with intentionally varied surface features but consistent underlying patterns tend to work better.

Few-Shot Prompting Best Practices

Use diverse, representative examples. Cover the range of expected inputs, not just the easy cases. If your task involves positive, negative, and neutral classifications, include at least one of each.

Keep format consistent across all examples. If one example uses Input: X / Output: Y, every example should follow that exact structure. Inconsistent formatting gives the model mixed signals about the expected output format.

Start with zero-shot, add examples only when needed. Prompt engineering is empirical work. Establish a baseline with no examples, then add demonstrations to fix specific failure modes you observe. Every example should earn its place in the prompt.

Test and iterate systematically. Change one variable at a time: number of examples, order, format, or content. Measure output quality against a consistent evaluation set. What works for one model or task may not generalize.

Consider dynamic example selection for production systems. Static example sets work for prototyping, but production systems benefit from retrieving the most relevant examples per input. This approach uses semantic similarity to match each new input against an example pool, pulling only the demonstrations most likely to help. The result is fewer wasted tokens and higher relevance per example. Systems that maintain a memory layer of successful prompt-response patterns can automate this selection, continuously improving example relevance based on what actually works in practice.

Compress your examples. Remove unnecessary context, filler words, or details that do not contribute to the pattern you are teaching. Shorter, cleaner examples are cheaper and often more effective because the signal-to-noise ratio is higher.

Frequently Asked Questions

What is the difference between few-shot prompting and fine-tuning?

Few-shot prompting provides examples at inference time within the prompt itself. No model weights change. Fine-tuning trains the model on a dataset, permanently updating its parameters. Few-shot is faster and cheaper to set up, making it ideal for prototyping and low-volume use cases. Fine-tuning offers better performance at scale but requires labeled data, compute resources, and ongoing maintenance when the model or requirements change.

Can few-shot prompting work with any LLM?

In principle, yes. In practice, effectiveness scales with model size. Brown et al. (2020) showed that few-shot capabilities emerged only when models reached sufficient scale. Smaller models (under 7B parameters) often fail to generalize from examples. Larger models like GPT-4, Claude, and Llama 3 70B+ tend to be strong few-shot learners and can pick up complex patterns from just 2 to 3 demonstrations.

Is few-shot prompting the same as in-context learning?

Few-shot prompting is a specific application of in-context learning (ICL). ICL is the broader ability of large language models to learn from information provided in the context window. Few-shot prompting is the deliberate practice of structuring that context with labeled input-output examples to guide behavior on a specific task.

On This Page

Content