Add image modality to Assistant

httpjames

OpenAI is rolling out multimodal ChatGPT to all Plus users. This was first announced at the GPT-4 unvailing last year.

This allows Plus users to upload all kinds of data to ChatGPT to add context to chats. For example, I can upload a photo of a leaky faucet and even draw a circle around the part I want ChatGPT to focus on and ask, “how should I fix this?”

Kagi Assistant is already sort of multimodal in the fact that I can attach a YouTube video, webpage article, podcast recording, text file, PDF and more, but it doesn’t have the ability to understand non-textual images in the way ChatGPT does.

Assistant can implement this using the various open-source image-to-text-prediction models, like CLIP or Stable Diffusion.

httpjames

OpenAI released gpt-4-vision-preview priced at $0.01/1k input, $0.03/1k output