Holo1.5-7B: Localization VLM Demo

This demo showcases Holo1.5-7B, a new version of the Action Vision-Language Model developed by HCompany, fine-tuned from Qwen/Qwen2.5-VL-7B-Instruct. It's designed to perform complex navigation tasks in Web, Android, and Desktop interfaces. How to use:

  1. Upload an image (e.g., a screenshot of a UI, see example below).
  2. Provide a target UI element (e.g., "Docs tab").
  3. The model will predict the coordinates of the element on the screenshot. The model processor resizes your input image. Coordinates are relative to this resized image.
Examples
Input UI Image component