MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection

Tongyi Lab, Alibaba
Qualitative comparison
Qualitative comparison on video object segmentation. (a), (c) show the results from SAM2, and (b),(d) are drawn from our MoSAM, superior in hard cases including object object disappearance and occlusion. Red boxes suggest the wrong segmentation or object object disappearance, and green boxes indicate accurate segmentation.

Abstract

The recent Segment Anything Model 2 (SAM2) has demonstrated exceptional capabilities in interactive object segmentation for both images and videos. However, as a foundational model on interactive segmentation, SAM2 performs segmentation directly based on mask memory from the past six frames, leading to two significant challenges. Firstly, during inference in videos, objects may disappear since SAM2 relies solely on memory without accounting for object motion information, which limits its long-range object tracking capabilities. Secondly, its memory is constructed from fixed past frames, making it susceptible to challenges associated with object disappearance or occlusion, due to potentially inaccurate segmentation results in memory.

To address these problems, we present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory. Firstly, we propose Motion-Guided Prompting (MGP), which represents the object motion in both sparse and dense manners, then injects them into SAM2 through a set of motion-guided prompts. MGP enables the model to adjust its focus towards the direction of motion, thereby enhancing the object tracking capabilities. Furthermore, acknowledging that past segmentation results may be inaccurate, we devise a Spatial-Temporal Memory Selection (ST-MS) mechanism that dynamically identifies frames likely to contain accurate segmentation in both pixel- and frame-level. By eliminating potentially inaccurate mask predictions from memory, we can leverage more reliable memory features to exploit similar regions for improving segmentation results. Extensive experiments on various benchmarks of video object segmentation and video instance segmentation demonstrate that our MoSAM achieves state-of-the-art results compared to other competitors.

Method

Method overview

We present MoSAM, a unified framework that synergistically integrates Motion-Guided Prompting (MGP) and Spatial-Temporal Memory Selection (ST-MS) to enhance motion-aware segmentation with reliable memory management.

To provide motion cues for the model, facilitating superior object tracking and segmentation, MGP captures the motion representation in both sparse and dense manners and then forecasts the subsequence object localization as future prompts. Considering that the SAM2 memory bank may contain unreliable frame features without objects, ST-MS is designed to adaptively pick up more reliable frame features to update the memory bank by using confidence from both temporal and spatial levels.

Videos

Video segmentation results of SAM2

Video segmentation results of our MoSAM

Video segmentation results of SAM2

Video segmentation results of our MoSAM

Video segmentation results of SAM2

Video segmentation results of our MoSAM

Results

Results comparison

Performance comparison between the baseline SAM2 and MoSAM across various model sizes, including Tiny (-T), Small (-S), Base (-B+) and Large (-L). Result gains over the baseline by our method are in red.

BibTeX

@article{arxiv_coming_soon
    author    = {Qiushi Yang, Yuan Yao, Miaomiao Cui, Liefeng Bo},
    title     = {MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection},
    journal   = {arXiv},
    year      = {2025.03},
  }