# 👀*SEEM:* Segment Everything Everywhere All at Once :apple:\[[Demo Route 1](https://ab79f1361bb060f6.gradio.app)\] :orange:\[[Demo Route 3](https://28d88f3bc59955d5.gradio.app)\] :kiwi_fruit:\[[Demo Route 4](https://ddbd9f45c9f9af07.gradio.app)\] :grapes:\[[ArXiv](https://arxiv.org/pdf/2212.11270.pdf)\] We introduce **SEEM** that can **S**egment **E**verything **E**verywhere with **M**ulti-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts! ## :bulb: Highlights We emphasize **4** important features of **SEEM** here. 1. **Versatility**: work with various types of prompts, for example, clicks, boxes, polygon, scribble, text, and referring image; 2. **Compositionaliy**: deal with any compositions of prompts; 3. **Interactivity**: interact with user multi-rounds because **SEEM** has a memory prompt to store the session history; 4. **Semantic awareness**: give a semantic label to any predicted mask;  A breif introduction of all the generic and interactive segmentation tasks we can do. Try our demo at xxx. ## 🔥Click, scribble to mask With a simple click or stoke from the user, we can generate the masks and the corresponding category labels for it.  ## 🔥Text to mask SEEM can generate the mask with text input from the user, providing multi-modality interaction with human.  ## 🔥Referring image to mask With a simple click or stroke on the referring image, the model is able to segment the objects with similar semantics on the target images.  SEEM seems understand the spatial relationshio very well. Look at the three zebras!  SEEM seems understand the oil pastel paintings painted by :chipmunk:  ## 🔥Audio to mask We use Whiper to turn audio into text prompt to segment the object. Try it in our demo!