PDF and CSV data analysis and summarization
This example demonstrates how to use the Gemini API to analyze data from PDF and CSV files.
Import necessary libraries
from google import genai
from google.genai import types
import httpx
import os
Initialize the Gemini client with your API key
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
We start with the PDF analysis.
Download the PDF file
pdf_url = "https://www.princexml.com/samples/invoice/invoicesample.pdf"
pdf_data = httpx.get(pdf_url).content
Prompt to extract main players from the PDF
pdf_prompt = (
"Identify the main companies or entities mentioned in this invoice. "
"Summarize the data."
)
Generate content with the PDF and the prompt
pdf_response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[
types.Part.from_bytes(data=pdf_data, mime_type="application/pdf"),
pdf_prompt,
],
)
Print the PDF analysis result
print("PDF Analysis Result:\n", pdf_response.text)
Moving on to the CSV analysis now. You'll note that the process is very
similar.
You can also pass in code files, XML, RTF, Markdown, and more.
We download the CSV file here.
csv_url = "https://gist.githubusercontent.com/suellenstringer-hye/f2231b3383538bcb1a5b051c7908f5b7/raw/0f4e0733a434733cda8e749bbbf33a93c2b5bbde/test.csv"
csv_data = httpx.get(csv_url).content
Prompt to analyze the CSV data
csv_prompt = "Analyze this data and tell me about the contents. Summarize the data."
Generate content with the CSV data and the prompt
csv_response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[
types.Part.from_bytes(
data=csv_data,
mime_type="text/csv",
),
csv_prompt,
],
)
Print the CSV analysis result
print("\nCSV Analysis Result:\n", csv_response.text)
Running the Example
First, install the Google Generative AI library and httpx (for downloading files)
$ pip install google-genai httpx pandas
Then run the program with Python
$ python pdf_csv_analysis.py
PDF Analysis Result:
The main company mentioned in the invoice is Sunny Farm.
CSV Analysis Result:
Okay, I've analyzed the provided data. Here's a summary of its contents:
**Data Format:**
* The data appears to be in CSV (Comma Separated Values) format.
* The first line is a header row defining the fields.
* Each subsequent line represents a record containing information about a person.
**Fields Present:**
The data includes the following fields for each person:
1. **first\_name:** The person's first name.
2. **last\_name:** The person's last name.
3. **company\_name:** The name of the company they are associated with.
4. **address:** The street address.
5. **city:** The city.
6. **county:** The county.
7. **state:** The state.
8. **zip:** The zip code.
9. **phone1:** The primary phone number.
10. **phone2:** A secondary phone number.
11. **email:** The email address.
12. **web:** The website address (presumably for the associated company).