mkagenius 3 days ago

VLMs are great - I have been able to use it for a similar project too [1]. And it's only going to get better. Congratulations on the product launch what's your VLM model for this?

1. A framework to use/control mobile phones via any LLM - https://github.com/BandarLabs/clickclickclick

2
marcon680 3 days ago

We finetune our own VLMs for this -- unfortunately prefer not to share which ones we use specifically! ClickClickClick looks awesome, have you heard of FerretUI (https://arxiv.org/pdf/2404.05719)? Pretty similar idea.

mkagenius 3 days ago

Yes, I tried a similar one called "omniparser" - where the issue was it was missing annotating some UI elements sometimes. Moreover, Gemini and Molmo worked right out of the box without needing any fine tune.

xnx 2 days ago

I'm surprised you named your framework clickclickclick instead of taptaptap.