Token counting & context windows

Context caching

This example demonstrates how to use the Gemini API's context caching feature to efficiently query a large document multiple times without resending it with each request. This can reduce costs when repeatedly referencing the same content.

Initialize the Gemini client with your API key

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

Specify a versioned model that supports context caching Note: Must use explicit version suffix (-001) for caching

model_id = "gemini-1.5-flash-001"

Load a large document (e.g., technical documentation). For this example, we assume the document is in markdown format.

response = requests.get("https://zenml.io/llms.txt")
response.raise_for_status()  # Raise an exception for HTTP errors
api_docs = response.text

Create a cache with the document and system instructions

cache = client.caches.create(
    model=model_id,
    config=CreateCachedContentConfig(
        display_name="ZenML LLMs.txt Documentation Cache",  # Used to identify the cache
        system_instruction=(
            "You are a technical documentation expert. "
            "Answer questions about the ZenML documentation provided. "
            "Keep your answers concise and to the point."
        ),
        contents=[api_docs],
        ttl="900s",  # Cache for 15 minutes
    ),
)

Display cache information

print(f"Cache created with name: {cache.name}")
print(f"Cached token count: {cache.usage_metadata.total_token_count}")
print(f"Cache expires at: {cache.expire_time}")

Define multiple queries to demonstrate reuse of cached content

queries = [
    "What are the recommended use cases for ZenML's pipeline orchestration?",
    "How does ZenML integrate with cloud providers?",
]

Run multiple queries using the same cached content

for query in queries:
    print(f"\nQuery: {query}")

Generate response using the cached content

response = client.models.generate_content(
        model=model_id,
        contents=query,
        config=GenerateContentConfig(cached_content=cache.name),
    )

Print token usage statistics to demonstrate savings

print(f"Total tokens: {response.usage_metadata.total_token_count}")
    print(f"Cached tokens: {response.usage_metadata.cached_content_token_count}")
    print(f"Output tokens: {response.usage_metadata.candidates_token_count}")

Print the response (truncated for brevity)

print(f"Response: {response.text}...")

    time.sleep(1)  # Short delay between requests

When done with the cache, you can delete it to free up resources

client.caches.delete(name=cache.name)

Running the Example

Install the Google Generative AI library

$ pip install google-genai

Run the Python script

$ python context-caching.py
Cache created with name: cachedContents/n8upgecthnz7
Cached token count: 107203
Cache expires at: 2025-04-05 20:21:48.818511+00:00
Query: What are the recommended use cases for ZenML's pipeline orchestration?
Total tokens: 107387
Cached tokens: 107203
Output tokens: 168
Response: ZenML's pipeline orchestration is well-suited for a wide range of machine learning workflows, including:
* **Data preprocessing:**  Ingesting, cleaning, transforming, and preparing data for model training.
* **Model training:**  Training various types of machine learning models, including deep learning models.
* **Model evaluation:**  Assessing model performance using different metrics and techniques.
* **Model deployment:**  Deploying trained models to different environments for inference.
* **Model monitoring:**  Monitoring the performance and health of deployed models in real-time.
* **A/B testing:**  Experimenting with different model variations and comparing their performance.
* **Hyperparameter tuning:**  Finding optimal hyperparameters for models.
* **Feature engineering:**  Developing and evaluating new features for improving model performance. 
...
Query: How does ZenML integrate with cloud providers?
Total tokens: 107326
Cached tokens: 107203
Output tokens: 113
Response: ZenML integrates with cloud providers by offering stack components that are specific to each provider, such as:
* **Artifact Stores:** S3 (AWS), GCS (GCP), Azure Blob Storage (Azure)
* **Orchestrators:** Skypilot (AWS, GCP, Azure), Kubernetes (AWS, GCP, Azure)
* **Container Registries:** ECR (AWS), GCR (GCP), ACR (Azure)
These components allow you to run pipelines on cloud infrastructure, enabling you to scale and leverage the benefits of cloud computing. 
...

Context caching

Running the Example

Further Information