lukebuehler 5 days ago

I just started to look into multi-modal embedding models recently, and I was surprised how few options there are.

For example, Google's model only supports 30 text tokens [1]!!

This is definitely a welcome addition.

Any pointers to similarly powerful embedding models? I'm looking specifically for text and images? I wish there'd be also one that could do audio and video, but I don't think that exists.

[1] https://cloud.google.com/vertex-ai/generative-ai/docs/embedd...

1
mahjongmen 5 days ago

Hey Luke, Our model does exceptionally well on text and images, and in particular, when text and images are mixed together. An example of where this works well would be in E-commerce where you may have a product title, description, and multiple images of the product. When combining that into a single payload using our inputs parameter we find that our model responds really well to adding more images (i.e. retrieval quality moves up as you add 1,2,3....N images). As you pointed out with Google's multimodal model, most jointly trained multimodal embedding models will suffer in the text modality. Amazon used to have a multimodal embedding model, which also took in a very small text payload. We're thinking about Audio / Video as well but nothing for Q2 at least....