Skip to content

Conversation

@ssam18
Copy link

@ssam18 ssam18 commented Dec 1, 2025

Summary

This PR implements native multimodality support for crewAI Agents and Tasks, addressing feature request #3860.

Changes

New Modules

multimodal/image.py

  • Image class for handling images in multiple formats:
    • HTTP(S) URLs via from_url()
    • Local file paths via from_file() with automatic MIME type detection
    • Base64 encoded strings via from_base64()
    • Binary data via from_binary()
    • Runtime placeholders via from_placeholder()
  • Automatic source type inference
  • Conversion methods: to_data_url(), to_message_content()
  • Compatible with OpenAI, Anthropic, and other vision-capable LLM APIs

multimodal/multipart_content.py

  • MultipartContent class for managing compound text+image context
  • Methods: add_text(), add_image(), to_message_content()
  • Utility methods: get_text_only(), has_images()

Integration

agent/core.py and task.py

  • Added multipart_context field to both Agent and Task classes
  • Field type: list[str | Image] | MultipartContent | None
  • Enables multimodal system prompts for agents and multimodal inputs for tasks

Usage Example

from crewai import Agent, Task
from crewai.multimodal import Image, MultipartContent

# Agent with multimodal context
agent = Agent(
    role="Vision Analyst",
    goal="Analyze images and diagrams",
    multipart_context=[
        "You are an expert at analyzing visual content.",
        Image.from_url("https://example.com/reference.png")
    ]
)

# Task with multipart content
content = MultipartContent()
content.add_text("Compare these two architectural diagrams:")
content.add_image(Image.from_file("diagram1.png"))
content.add_image(Image.from_file("diagram2.png"))

task = Task(
    description="Identify differences between the diagrams",
    multipart_context=content,
    agent=agent
)

Testing

Comprehensive test suite included in tests/multimodal/:

  • test_image.py: 19 test cases covering all Image class functionality
  • test_multipart_content.py: 13 test cases for MultipartContent
  • run_tests.py: Custom test runner for environments without pytest
  • validate_implementation.py: Syntax and structure validation

All validation checks pass successfully.

Closes

#3860


Note

Introduces multimodal support via new Image and MultipartContent classes and integrates them as multipart_context on Agent and Task, with comprehensive tests.

  • Multimodal module (crewai/multimodal/):
    • Image: constructors for URL/file/base64/binary/placeholder; conversion to data URLs and LLM message content.
    • MultipartContent: manage mixed text+images; convert to LLM message parts; text-only extraction and helpers.
  • Core integration:
    • crewai/agent/core.py: add multipart_context: list[str | Image] | MultipartContent | None to include multimodal context in system prompt.
    • crewai/task.py: add multipart_context: list[str | Image] | MultipartContent | None to enrich task context.
  • Tests (tests/multimodal/):
    • Unit tests for Image and MultipartContent; integration checks; standalone runner and validator.

Written by Cursor Bugbot for commit 448c815. This will update automatically on new commits. Configure here.

Implements crewAIInc#3860

This PR introduces comprehensive multimodality support for crewAI Agents
and Tasks, enabling them to work with images and mixed text+image content.

## Key Features

### Image Class (multimodal/image.py)
- Support for multiple image sources:
  - HTTP(S) URLs via from_url()
  - Local file paths via from_file() with auto MIME detection
  - Base64 encoded strings via from_base64()
  - Binary data via from_binary()
  - Runtime placeholders via from_placeholder()
- Automatic source type inference
- Conversion to data URLs and LLM message format
- Compatible with OpenAI, Anthropic, and other vision-capable LLMs

### MultipartContent Class (multimodal/multipart_content.py)
- Manage compound context with text and images
- add_text() and add_image() methods for building content
- to_message_content() for LLM API format conversion
- Utility methods: get_text_only(), has_images()

### Agent & Task Integration
- Added multipart_context field to both Agent and Task classes
- Accepts: list[str | Image], MultipartContent, or None
- Enables multimodal system prompts and task inputs

## Usage Examples

## Testing
- Comprehensive unit tests in tests/multimodal/
- test_image.py: 19 test cases for Image class
- test_multipart_content.py: 13 test cases for MultipartContent
- Custom test runner (run_tests.py) for environments without pytest

Signed-off-by: Samaresh Kumar Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant