Multimodal I/O: Images, Files, and Audio in Your AI Recipes

AI automation shouldn’t be limited to text. The work your team does every day involves screenshots, PDFs, spreadsheets, voice memos, and images — not just words in a text box.

JieGou recipes and workflows now support multimodal inputs and outputs. Upload an image and ask Claude to analyze it. Attach a PDF and extract structured data. Record audio and let Whisper transcribe it before the LLM processes it. Generate images as part of your output. And chain all of it across workflow steps.

What you can upload

Recipes now accept three types of media alongside text inputs:

Images — JPEG, PNG, WebP, and GIF. Upload a screenshot, a product photo, or a chart, and the LLM sees it natively. Image inputs work with Claude (Anthropic), GPT-4o (OpenAI), and Gemini (Google) — all three providers support vision out of the box.

Documents — PDF, DOCX, CSV, XLSX, TXT, Markdown, and HTML. Upload a contract, a spreadsheet, or a report. JieGou parses the document server-side and delivers the content to the LLM in the most effective format for each provider. Anthropic and Google receive documents natively as file attachments. For providers without native file support, JieGou extracts the text and injects it into the prompt.

Audio — WebM, MP3, MP4, WAV, FLAC, and other common formats. Audio handling depends on the model. Google Gemini and OpenAI’s audio-preview models process audio natively — the raw audio goes straight to the LLM. For all other models (including Claude), JieGou transcribes the audio via OpenAI’s Whisper API and passes the transcript as text. This fallback happens automatically. You don’t need to configure anything.

How it works under the hood

When you add an image, file, or audio field to a recipe’s input schema, JieGou marks it with a widget annotation (image-upload, file-upload, or audio-upload). At execution time, three things happen:

Extraction. JieGou scans the input for media fields and separates them from text inputs. Image fields become ChatImage objects (base64 data + MIME type). Files are parsed into structured content. Audio is identified for native or fallback handling.
Provider routing. JieGou checks what the target model supports natively. If the provider handles the media type directly, it builds a multipart message — interleaving images, files, and text in a single request. If not, it falls back gracefully: documents become extracted text in <attached_file> tags, audio becomes a Whisper transcript in <transcribed_audio> tags.
Message assembly. The final message sent to the LLM combines all media and text into the format each provider expects. The Vercel AI SDK handles the last mile of provider-specific formatting.

The result: you write one recipe, and it works across Claude, GPT, and Gemini without any provider-specific configuration.

Document parsing

File uploads aren’t just passed through as raw bytes. JieGou parses each format server-side to extract clean, structured content:

PDF — Full text extraction with page count metadata
DOCX — Raw text extraction without formatting artifacts
CSV / TXT / Markdown — UTF-8 text passed through directly
XLSX — First worksheet converted to CSV rows, plus metadata (sheet count, row count)
HTML — Script and style tags stripped, entities decoded, clean text extracted

File size is capped at 10 MB per upload, and extracted content is limited to 1 MB of text — enough for most business documents while keeping LLM context usage reasonable.

Image generation

Some models can generate images as part of their output. When GPT-4o or Gemini produces an image, JieGou captures it automatically. Generated images appear in the recipe output alongside text, with download buttons for saving them locally.

This means you can build recipes that take a text description and produce a visual — product mockups, social media graphics, chart visualizations — without leaving JieGou.

Chaining multimodal content across workflow steps

The real power shows up in workflows. When one step produces images — whether generated by an LLM or captured via a browser screenshot — those images are stored in the workflow context and made available to downstream steps.

Here’s a concrete example:

Step 1 (Browser action) — Navigate to a dashboard and take a screenshot
Step 2 (LLM step) — Analyze the screenshot, identify anomalies, write a summary
Step 3 (Image generation) — Generate a cleaned-up chart based on the analysis
Step 4 (LLM step) — Compose a report combining the analysis text and generated chart

Each step automatically receives the images produced by earlier steps. No manual wiring. The workflow engine handles the plumbing through a hidden _images field that propagates through the step context.

Provider support matrix

Capability	Anthropic (Claude)	OpenAI (GPT-4o)	Google (Gemini)
Image input	Native	Native	Native
Document input	Native file attachment	Text extraction fallback	Native file attachment
Audio input	Whisper transcription	Native (audio-preview models)	Native (Gemini 2.5+)
Image generation	—	Native	Native

Availability

Multimodal inputs — images, files, and audio — are available on Pro plans and above. Image generation output works with any model that supports it. Learn more about recipes or start your free trial.