Claude introduced prompt caching for their API.
Think of prompt caching as the ability to save useful pieces of information that can be easily recalled later, so you don't have to keep repeating yourself.
You could then have a conversation about those pieces of information without having to repeat the same information over and over.
In AI models like Claude, charges are typically based on the number of tokens processed. A token can be as short as a character or as long as a word, and the more tokens your input contains, the higher the cost.
Let's start with a prompt example with no caching:
System:
You are a legal expert tasked with analyzing legal agreements. Here is the full text of the legal agreement: [50 Page Legal Agreement]
User:
What are key terms of this agreement?
Executing this prompt incurs token charges for the entire input, including the system message containing the legal document, the user's question, and the resulting output from Claude.
Now lets say you want to ask it another question.
User:
Summarize this agreement for me.
Claude will now execute this prompt, along with the document again and charge you for everything again.
With caching, it works like this:
System:
You are a legal expert tasked with analyzing legal agreements. Here is the full text of the legal agreement: [50 Page Legal Agreement] (CACHE)
User:
What are key terms of this agreement?
When you execute this prompt, it will charge you tokens for both the 'system' message (and long document) caching, the 'user' message, and the actual answer.
But when you ask it a new question like this:
User:
Summarize this agreement for me.
Since the system message has been cached, Claude doesn't need to reprocess it, which significantly reduces both the token cost and response time for subsequent queries.
A couple of things to keep in mind:
The cache lifetime, or Time-To-Live (TTL), is 5 minutes, meaning the cached prompt remains available for reuse during this period. After 5 minutes, the cache expires, and the data must be re-sent if needed.
Currently, there’s no manual way to clear the cache before it expires, so any cached data remains in the system until the 5-minute TTL elapses.
Only prompts that exceed a minimum length of 1024 to 2048 tokens, depending on the Claude model, are eligible for caching. This ensures that caching is reserved for more substantial and complex inputs.
Caching only works via the API (as far as I know).
Performance metrics for caching, such as cache_creation_input_tokens and cache_read_input_tokens, can be monitored to assess the effectiveness of caching in reducing token usage and speeding up responses.
Prompt caching in Claude's API offers significant advantages by allowing developers to store and reuse frequently used prompts, which leads to substantial cost savings and performance improvements. By caching lengthy or complex prompts, users can avoid redundant token charges and reduce latency by up to 85%. This is particularly beneficial in scenarios like multi-turn conversations, large document analysis, and coding assistance, where maintaining consistent context across interactions is crucial.
Prompt caching can potentially be a game-changer in reducing the overhead for long, complex interactions, leading some to say that it might rival or even replace traditional Retrieval-Augmented Generation (RAG) methods in certain applications.
You can learn more about the prompt-caching API here.
ความคิดเห็น