Item 44003925

oezi • 14 hours ago

I wish multimodal would imply text, image and audio (+potentially video). If a model supports only image generation or image analysis, vision model seems the more appropriate term.

We should aim to distinguish multimodal modals such as Qwen2.5-Omni from Qwen2.5-VL.

In this sense: Ollama's new engine adds vision support.

prettyblocks • 9 hours ago

I'm very interested in working with video inputs, is it possible to do that with Qwen2.5-Omni and Ollama?

3 replies

tough • 1 hour ago

https://huggingface.co/blog/smolvlm

oezi • 6 hours ago

I have only tested Qwen2.5-Omni for audio and it was hit and miss for my use case of tagging audio.

machinelearning • 4 hours ago

What's a use case are you interested in re: video?