This paper presents a model **SEEM** that can Segment Everything Everywhere all at once. Our SEEM allows users to easily segment visual an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also handle any combination of prompts or generalize to custom prompts.
We emphasize $4$ important features of **SEE** here.
1. Versatility: work on various types of prompts;
2. Compositionaliy: deal with any compositions of prompts;
3. Interactive: dealmulti-round interactions with human because **SEE** has a memory prompt to store the session history;
4. Semantic awareness: give a semantic label to any predicted mask;