update

2023-10-04 16:58:32 -05:00 · 2023-10-04 16:58:32 -05:00 · 82ccd5e55a
parent 1100c31e62
commit 82ccd5e55a
1 changed files with 29 additions and 22 deletions
--- a/README.md
+++ b/README.md
@ -4,12 +4,12 @@ We introduce **SEEM** that can **S**egment **E**verything **E**verywhere with **

 :grapes: \[[Read our arXiv Paper](https://arxiv.org/pdf/2304.06718.pdf)\] &nbsp; :apple: \[[Try our Demo](http://semantic-sam.xyzou.net:6090/)\] 

-:fire: **One-Line Demo with Linux:**
+**One-Line Demo with Linux:**
 ```sh
 git clone git@github.com:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && sh aasets/scripts/run_demo.sh
 ```

-:fire: **Getting Started:**
+**Getting Started:**

 [INSTALL.md](assets/readmes/INSTALL.md) <br>
 [DATASET.md](assets/readmes/DATASET.md) <br>
@ -44,6 +44,25 @@ git clone git@github.com:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.gi
  <img src="https://user-images.githubusercontent.com/11957155/233526415-a0a44963-19a3-4e56-965a-afaa598e6127.gif" width="400" /> 
 </p>

+## :bookmark_tabs: Catelog
+- [x] SEEM Demo
+- [x] Inference and Installation Code
+- [x] Evaluation Code
+- [x] Training Code
+
+:fire: **Related projects:**
+
+* [FocalNet](https://github.com/microsoft/FocalNet) and [DaViT](https://github.com/dingmyu/davit) : We used FocalNet and DaViT as the vision backbones.
+* [UniCL](https://github.com/microsoft/UniCL) : We used unified contrastive learning technique for learning image-text representations.
+* [X-Decoder](https://github.com/microsoft/X-Decoder) : We built SEEM based on X-Decoder which is a generalist decoder that can do multiple tasks with one model only.
+
+:fire: **Other projects you may find interesting:**
+* [Semantic-SAM](https://github.com/UX-Decoder/Semantic-SAM), a universal image segmentation model to enable segment and recognize anything at any desired granularity
+* [OpenSeed](https://github.com/IDEA-Research/OpenSeeD) : Strong open-set segmentation methods.
+* [Grounding SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) : Combining Grounding DINO and Segment Anything; [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO): A strong open-set detection model.
+* [X-GPT](https://github.com/microsoft/X-Decoder/tree/xgpt) : Conversational Visual Agent supported by X-Decoder.
+* [LLaVA](https://github.com/haotian-liu/LLaVA) : Large Language and Vision Assistant.
+
 ## :bulb: Highlights
 Inspired by the appealing universal interface in LLMs, we are advocating a universal, interactive multi-modal interface for any type of segmentation with **ONE SINGLE MODEL**. We emphasize **4** important features of **SEEM** below.
 1. **Versatility**: work with various types of prompts, for example, clicks, boxes, polygons, scribbles, texts, and referring image;
@ -139,24 +158,12 @@ In the following figure, we compare the levels of interaction and semantics of t
 </div>
 <!-- This figure shows a comparison of our model with concurrent work SAM on the level of interactions and semantics. The x-axis and y-axis denote the level of interaction and semantics, respectively. Three segmentation tasks are shown, including Open-set Segmentation, Edge detection, and Interactive Segmentation. These tasks have different levels of interactions and semantics. For example, Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with SAM, our model covers a wider range of interaction and semantics levels. For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. Note that although we do not report edge detection results, our model can support it by simply converting masks to edges. -->

-## :bookmark_tabs: Catelog
- [x] SEEM Demo
- [x] Inference and Installation Code
- [x] Evaluation Code
- [x] Training Code
-
-:fire: **Related projects:**
-
-* [FocalNet](https://github.com/microsoft/FocalNet) and [DaViT](https://github.com/dingmyu/davit) : We used FocalNet and DaViT as the vision backbones.
-* [UniCL](https://github.com/microsoft/UniCL) : We used unified contrastive learning technique for learning image-text representations.
-* [X-Decoder](https://github.com/microsoft/X-Decoder) : We built SEEM based on X-Decoder which is a generalist decoder that can do multiple tasks with one model only.
-
-:fire: **Other projects you may find interesting:**
-* [Semantic-SAM](https://github.com/UX-Decoder/Semantic-SAM), a universal image segmentation model to enable segment and recognize anything at any desired granularity
-* [OpenSeed](https://github.com/IDEA-Research/OpenSeeD) : Strong open-set segmentation methods.
-* [Grounding SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) : Combining Grounding DINO and Segment Anything; [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO): A strong open-set detection model.
-* [X-GPT](https://github.com/microsoft/X-Decoder/tree/xgpt) : Conversational Visual Agent supported by X-Decoder.
-* [LLaVA](https://github.com/haotian-liu/LLaVA) : Large Language and Vision Assistant.
-
 ## :cupid: Acknowledgements
- We appreciate hugging face for the GPU support on demo!
+- We appreciate hugging face for the GPU support on demo!
+
+
+<!-- ## Citation (update when paper is available on arxiv)
+If you find this project helpful for your research, please consider citing the following BibTeX entry.
+```BibTex
+
+``` -->