Holo1.5-7B: Localization VLM Demo
This demo showcases Holo1.5-7B, a new version of the Action Vision-Language Model developed by HCompany, fine-tuned from Qwen/Qwen2.5-VL-7B-Instruct. It's designed to perform complex navigation tasks in Web, Android, and Desktop interfaces. How to use:
- Upload an image (e.g., a screenshot of a UI, see example below).
- Provide a target UI element (e.g., "Docs tab").
- The model will predict the coordinates of the element on the screenshot. The model processor resizes your input image. Coordinates are relative to this resized image.
Examples
Input UI Image | component |
---|