From d3a0b67d8dc30face6f10641bf011447288abeb0 Mon Sep 17 00:00:00 2001 From: MaureenZOU Date: Wed, 4 Oct 2023 15:22:34 -0500 Subject: [PATCH] change image path --- README.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index ba1cf8b..1b60774 100644 --- a/README.md +++ b/README.md @@ -67,7 +67,7 @@ Inspired by the appealing universal interface in LLMs, we are advocating a unive 3. **Interactivity**: interact with user in multi-rounds, thanks to the memory prompt of **SEEM** to store the session history; 4. **Semantic awareness**: give a semantic label to any predicted mask; -![SEEM design](assets/imagesteaser_new.png?raw=true) +![SEEM design](assets/images/teaser_new.png?raw=true) A brief introduction of all the generic and interactive segmentation tasks we can do. ## :unicorn: How to use the demo @@ -82,7 +82,7 @@ A brief introduction of all the generic and interactive segmentation tasks we ca An example of Transformers. The referred image is the truck form of Optimus Prime. Our model can always segment Optimus Prime in target images no matter which form it is in. Thanks Hongyang Li for this fun example.
-assets/imagestransformers_gh.png +assets/images/transformers_gh.png
## :tulip: NERF Examples @@ -95,32 +95,32 @@ An example of Transformers. The referred image is the truck form of Optimus Prim ## :camping: Click, scribble to mask With a simple click or stoke from the user, we can generate the masks and the corresponding category labels for it. -![SEEM design](assets/imagesclick.png?raw=true) +![SEEM design](assets/images/click.png?raw=true) ## :mountain_snow: Text to mask SEEM can generate the mask with text input from the user, providing multi-modality interaction with human. -![example](assets/imagestext.png?raw=true) +![example](assets/images/text.png?raw=true) ## :mosque: Referring image to mask With a simple click or stroke on the referring image, the model is able to segment the objects with similar semantics on the target images. -![example](assets/imagesref_seg.png?raw=true) +![example](assets/images/ref_seg.png?raw=true) SEEM understands the spatial relationship very well. Look at the three zebras! The segmented zebras have similar positions with the referred zebras. For example, when the leftmost zebra is referred on the upper row, the leftmost zebra on the bottom row is segmented. -![example](assets/imagesspatial_relation.png?raw=true) +![example](assets/images/spatial_relation.png?raw=true) ## :blossom: Referring image to video mask No training on video data needed, SEEM works perfectly for you to segment videos with whatever queries you specify! -![example](assets/imagesreferring_video_visualize.png?raw=true) +![example](assets/images/referring_video_visualize.png?raw=true) ## :sunflower: Audio to mask We use Whisper to turn audio into text prompt to segment the object. Try it in our demo!
-assets/imagesaudio.png +assets/images/audio.png
@@ -128,30 +128,30 @@ We use Whisper to turn audio into text prompt to segment the object. Try it in o ## :deciduous_tree: Examples of different styles An example of segmenting a meme.
-assets/imagesemoj.png +assets/images/emoj.png
An example of segmenting trees in cartoon style.
-assets/imagestrees_text.png +assets/images/trees_text.png
An example of segmenting a Minecraft image.
-assets/imagesminecraft.png +assets/images/minecraft.png
- + An example of using referring image on a popular teddy bear. -![example](assets/imagesfox_v2.png?raw=true) +![example](assets/images/fox_v2.png?raw=true) ## Model -![SEEM design](assets/imagesmodel.png?raw=true) +![SEEM design](assets/images/model.png?raw=true) ## Comparison with SAM In the following figure, we compare the levels of interaction and semantics of three segmentation tasks (edge detection, open-set, and interactive segmentation). Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with [SAM](https://arxiv.org/abs/2304.02643), SEEM covers a wider range of interaction and semantics levels. For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. The reasons are: First, SEEM has a unified prompt encoder that encodes all visual and language prompts into a joint representation space. In consequence, SEEM can support more general usages. It has potential to extend to custom prompts. Second, SEEM works very well on text to mask (grounding segmentation) and outputs semantic-aware predictions.
-assets/imagescompare.jpg +assets/images/compare.jpg