AI automation shouldn’t be limited to text. The work your team does every day involves screenshots, PDFs, spreadsheets, voice memos, and images — not just words in a text box.
JieGou recipes and workflows now support multimodal inputs and outputs. Upload an image and ask Claude to analyze it. Attach a PDF and extract structured data. Record audio and let Whisper transcribe it before the LLM processes it. Generate images as part of your output. And chain all of it across workflow steps.
What you can upload
Recipes now accept three types of media alongside text inputs:
Images — JPEG, PNG, WebP, and GIF. Upload a screenshot, a product photo, or a chart, and the LLM sees it natively. Image inputs work with Claude (Anthropic), GPT-4o (OpenAI), and Gemini (Google) — all three providers support vision out of the box.
Documents — PDF, DOCX, CSV, XLSX, TXT, Markdown, and HTML. Upload a contract, a spreadsheet, or a report. JieGou parses the document server-side and delivers the content to the LLM in the most effective format for each provider. Anthropic and Google receive documents natively as file attachments. For providers without native file support, JieGou extracts the text and injects it into the prompt.
Audio — WebM, MP3, MP4, WAV, FLAC, and other common formats. Audio handling depends on the model. Google Gemini and OpenAI’s audio-preview models process audio natively — the raw audio goes straight to the LLM. For all other models (including Claude), JieGou transcribes the audio via OpenAI’s Whisper API and passes the transcript as text. This fallback happens automatically. You don’t need to configure anything.
How it works under the hood
When you add an image, file, or audio field to a recipe’s input schema, JieGou marks it with a widget annotation (image-upload, file-upload, or audio-upload). At execution time, three things happen:
-
Extraction. JieGou scans the input for media fields and separates them from text inputs. Image fields become
ChatImageobjects (base64 data + MIME type). Files are parsed into structured content. Audio is identified for native or fallback handling. -
Provider routing. JieGou checks what the target model supports natively. If the provider handles the media type directly, it builds a multipart message — interleaving images, files, and text in a single request. If not, it falls back gracefully: documents become extracted text in
<attached_file>tags, audio becomes a Whisper transcript in<transcribed_audio>tags. -
Message assembly. The final message sent to the LLM combines all media and text into the format each provider expects. The Vercel AI SDK handles the last mile of provider-specific formatting.
The result: you write one recipe, and it works across Claude, GPT, and Gemini without any provider-specific configuration.
Document parsing
File uploads aren’t just passed through as raw bytes. JieGou parses each format server-side to extract clean, structured content:
- PDF — Full text extraction with page count metadata
- DOCX — Raw text extraction without formatting artifacts
- CSV / TXT / Markdown — UTF-8 text passed through directly
- XLSX — First worksheet converted to CSV rows, plus metadata (sheet count, row count)
- HTML — Script and style tags stripped, entities decoded, clean text extracted
File size is capped at 10 MB per upload, and extracted content is limited to 1 MB of text — enough for most business documents while keeping LLM context usage reasonable.
Image generation
Some models can generate images as part of their output. When GPT-4o or Gemini produces an image, JieGou captures it automatically. Generated images appear in the recipe output alongside text, with download buttons for saving them locally.
This means you can build recipes that take a text description and produce a visual — product mockups, social media graphics, chart visualizations — without leaving JieGou.
Chaining multimodal content across workflow steps
The real power shows up in workflows. When one step produces images — whether generated by an LLM or captured via a browser screenshot — those images are stored in the workflow context and made available to downstream steps.
Here’s a concrete example:
- Step 1 (Browser action) — Navigate to a dashboard and take a screenshot
- Step 2 (LLM step) — Analyze the screenshot, identify anomalies, write a summary
- Step 3 (Image generation) — Generate a cleaned-up chart based on the analysis
- Step 4 (LLM step) — Compose a report combining the analysis text and generated chart
Each step automatically receives the images produced by earlier steps. No manual wiring. The workflow engine handles the plumbing through a hidden _images field that propagates through the step context.
Provider support matrix
| Capability | Anthropic (Claude) | OpenAI (GPT-4o) | Google (Gemini) |
|---|---|---|---|
| Image input | Native | Native | Native |
| Document input | Native file attachment | Text extraction fallback | Native file attachment |
| Audio input | Whisper transcription | Native (audio-preview models) | Native (Gemini 2.5+) |
| Image generation | — | Native | Native |
Availability
Multimodal inputs — images, files, and audio — are available on Pro plans and above. Image generation output works with any model that supports it. Learn more about recipes or start your free trial.