Abstract: This paper proposes a novel framework utilizing multimodal large language models (MLLMs) for referring video object segmentation (RefVOS). Previous MLLMbased methods commonly struggle with ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results