Gemini by Example
Token counting & context windows

Context caching

This example demonstrates how to use the Gemini API's context caching feature to efficiently query a large document multiple times without resending it with each request. This can reduce costs when repeatedly referencing the same content.

Initialize the Gemini client with your API key
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

Specify a versioned model that supports context caching Note: Must use explicit version suffix (-001) for caching
model_id = "gemini-1.5-flash-001"

Load a large document (e.g., technical documentation). For this example, we assume the document is in markdown format.
response = requests.get("https://zenml.io/llms.txt")
response.raise_for_status()  # Raise an exception for HTTP errors
api_docs = response.text

Create a cache with the document and system instructions
cache = client.caches.create(
    model=model_id,
    config=CreateCachedContentConfig(
        display_name="ZenML LLMs.txt Documentation Cache",  # Used to identify the cache
        system_instruction=(
            "You are a technical documentation expert. "
            "Answer questions about the ZenML documentation provided. "
            "Keep your answers concise and to the point."
        ),
        contents=[api_docs],
        ttl="900s",  # Cache for 15 minutes
    ),
)

Display cache information
print(f"Cache created with name: {cache.name}")
print(f"Cached token count: {cache.usage_metadata.total_token_count}")
print(f"Cache expires at: {cache.expire_time}")

Define multiple queries to demonstrate reuse of cached content
queries = [
    "What are the recommended use cases for ZenML's pipeline orchestration?",
    "How does ZenML integrate with cloud providers?",
]

Run multiple queries using the same cached content
for query in queries:
    print(f"\nQuery: {query}")

Generate response using the cached content
response = client.models.generate_content(
        model=model_id,
        contents=query,
        config=GenerateContentConfig(cached_content=cache.name),
    )

Print token usage statistics to demonstrate savings
print(f"Total tokens: {response.usage_metadata.total_token_count}")
    print(f"Cached tokens: {response.usage_metadata.cached_content_token_count}")
    print(f"Output tokens: {response.usage_metadata.candidates_token_count}")

Print the response (truncated for brevity)
print(f"Response: {response.text}...")

    time.sleep(1)  # Short delay between requests

When done with the cache, you can delete it to free up resources
client.caches.delete(name=cache.name)

Running the Example

Install the Google Generative AI library
$ pip install google-genai
Run the Python script
$ python context-caching.py
Cache created with name: cachedContents/n8upgecthnz7
Cached token count: 107203
Cache expires at: 2025-04-05 20:21:48.818511+00:00
Query: What are the recommended use cases for ZenML's pipeline orchestration?
Total tokens: 107387
Cached tokens: 107203
Output tokens: 168
Response: ZenML's pipeline orchestration is well-suited for a wide range of machine learning workflows, including:
* **Data preprocessing:**  Ingesting, cleaning, transforming, and preparing data for model training.
* **Model training:**  Training various types of machine learning models, including deep learning models.
* **Model evaluation:**  Assessing model performance using different metrics and techniques.
* **Model deployment:**  Deploying trained models to different environments for inference.
* **Model monitoring:**  Monitoring the performance and health of deployed models in real-time.
* **A/B testing:**  Experimenting with different model variations and comparing their performance.
* **Hyperparameter tuning:**  Finding optimal hyperparameters for models.
* **Feature engineering:**  Developing and evaluating new features for improving model performance. 
...
Query: How does ZenML integrate with cloud providers?
Total tokens: 107326
Cached tokens: 107203
Output tokens: 113
Response: ZenML integrates with cloud providers by offering stack components that are specific to each provider, such as:
* **Artifact Stores:** S3 (AWS), GCS (GCP), Azure Blob Storage (Azure)
* **Orchestrators:** Skypilot (AWS, GCP, Azure), Kubernetes (AWS, GCP, Azure)
* **Container Registries:** ECR (AWS), GCR (GCP), ACR (Azure)
These components allow you to run pipelines on cloud infrastructure, enabling you to scale and leverage the benefits of cloud computing. 
...

Further Information