Watch demos on how to build & run AI-powered apps with Firebase at Demo Day '24. Watch now.

Generate text from multimodal prompts using the Gemini API

When calling the Gemini API from your app using a Vertex AI in Firebase SDK, you can prompt the Gemini model to generate text based on a multimodal input. Multimodal prompts can include multiple modalities (or types of input), like text along with images, PDFs, plain-text files, video, and audio.

In each multimodal request, you must always provide the following:

The file's mimeType. Learn about each input file's supported MIME types.
The file. You can either provide the file as inline data (as shown on this page) or using its URL or URI.

For testing and iterating on multimodal prompts, we recommend using Vertex AI Studio.

Other options for working with the Gemini API

Optionally experiment with an alternative "Google AI" version of the Gemini API
Get free-of-charge access (within limits and where available) using Google AI Studio and Google AI client SDKs. These SDKs should be used for prototyping only in mobile and web apps.

After you're familiar with how a Gemini API works, migrate to our Vertex AI in Firebase SDKs (this documentation), which have many additional features important for mobile and web apps, like protecting the API from abuse using Firebase App Check and support for large media files in requests.

Optionally call the Vertex AI Gemini API server-side (like with Python, Node.js, or Go)
Use the server-side Vertex AI SDKs, Firebase Genkit, or Firebase Extensions for the Gemini API.

Before you begin

If you haven't already, complete the getting started guide for the Vertex AI in Firebase SDKs. Make sure that you've done all of the following:

Set up a new or existing Firebase project, including using the Blaze pricing plan and enabling the required APIs.
Connect your app to Firebase, including registering your app and adding your Firebase config to your app.
Add the SDK and initialize the Vertex AI service and the generative model in your app.

After you've connected your app to Firebase, added the SDK, and initialized the Vertex AI service and the generative model, you're ready to call the Gemini API.

Generate text from text and a single image Generate text from text and multiple images Generate text from text and a video

Sample media files

If you don't already have media files, then you can use the following publicly available files:

Image: gs://cloud-samples-data/generative-ai/image/scones.jpg with a MIME type of image/jpeg.
View or download this image.
PDF: gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf with a MIME type of application/pdf.
View or download this PDF.
Video: gs://cloud-samples-data/video/animals.mp4 with a MIME type of video/mp4.
View or download this video.
Audio: gs://cloud-samples-data/generative-ai/audio/pixel.mp3 with a MIME type of audio/mp3.
Listen to or download this audio.

Generate text from text and a single image

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and a single file (like an image, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 1.5 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Generate text from text and multiple images

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and multiple files (like images, as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 1.5 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

Without streaming

Alternatively, you can alternatively wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Generate text from text and a video

Make sure that you've completed the Before you begin section of this guide before trying this sample.

You can call the Gemini API with multimodal prompts that include both text and video file(s) (as shown in this example). For these calls, you need to use a model that supports media in prompts (like Gemini 1.5 Flash).

Make sure to review the requirements and recommendations for input files.

Choose whether you want to stream the response (generateContentStream) or wait for the response until the entire result is generated (generateContent).

Streaming

You can achieve faster interactions by not waiting for the entire result from the model generation, and instead use streaming to handle partial results.

Without streaming

Alternatively, you can wait for the entire result instead of streaming; the result is only returned after the model completes the entire generation process.

Learn how to choose a Gemini model and optionally a location appropriate for your use case and app.

Requirements and recommendations for input files

See Supported input files and requirements for the Vertex AI Gemini API to learn about the following:

Different options for providing a file in a request
Supported file types
Supported MIME types and how to specify them
Requirements and best practices for files and multimodal requests

What else can you do?

Learn how to count tokens before sending long prompts to the model.
Set up Cloud Storage for Firebase so that you can include large files in your multimodal requests and have a more managed solution for providing files in prompts. Files can include images, PDFs, video, and audio.
Start thinking about preparing for production, including setting up Firebase App Check to protect the Gemini API from abuse by unauthorized clients.

Try out other capabilities of the Gemini API

Build multi-turn conversations (chat).
Generate text from text-only prompts.
Generate structured output (like JSON) from both text and multimodal prompts.
Use function calling to connect generative models to external systems and information.

Learn how to control content generation

Understand prompt design, including best practices, strategies, and example prompts.
Configure model parameters like temperature and maximum output tokens.
Use safety settings to adjust the likelihood of getting responses that may be considered harmful.

You can also experiment with prompts and model configurations using Vertex AI Studio.

Learn more about the Gemini models

Learn about the models available for various use cases and their quotas and pricing.

Give feedback about your experience with Vertex AI in Firebase