Grounded SAM 2 Release

Merge pull request #342 from ethanlee928/main
fix Supervision depreciation of BoxAnnotator
2024-08-12 16:52:02 +08:00 · 2024-07-24 08:59:41 +02:00 · 2024-07-24 08:58:19 +02:00 · 2024-07-23 23:19:52 +08:00 · 2024-07-12 12:27:24 +02:00 · 2024-06-29 01:10:48 +08:00
26 changed files with 2058 additions and 101 deletions
--- a/.asset/cat_dog.jpeg
+++ b/.asset/cat_dog.jpeg
--- a/.asset/grounding_dino_logo.png
+++ b/.asset/grounding_dino_logo.png
--- a/.asset/model_explan1.PNG
+++ b/.asset/model_explan1.PNG
--- a/.asset/model_explan2.PNG
+++ b/.asset/model_explan2.PNG
--- a/35
+++ b/35
@ -0,0 +1,35 @@
+FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
+ARG DEBIAN_FRONTEND=noninteractive
+
+ENV CUDA_HOME=/usr/local/cuda \
+     TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6+PTX" \
+     SETUPTOOLS_USE_DISTUTILS=stdlib
+
+RUN conda update conda -y
+
+# Install libraries in the brand new image. 
+RUN apt-get -y update && apt-get install -y --no-install-recommends \
+         wget \
+         build-essential \
+         git \
+         python3-opencv \
+         ca-certificates && \
+    rm -rf /var/lib/apt/lists/*
+
+# Set the working directory for all the subsequent Dockerfile instructions.
+WORKDIR /opt/program
+
+RUN git clone https://github.com/IDEA-Research/GroundingDINO.git
+
+RUN mkdir weights ; cd weights ; wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth ; cd ..
+
+RUN conda install -c "nvidia/label/cuda-12.1.1" cuda -y
+ENV CUDA_HOME=$CONDA_PREFIX
+
+ENV PATH=/usr/local/cuda/bin:$PATH
+
+RUN cd GroundingDINO/ && python -m pip install .
+
+COPY docker_test.py docker_test.py
+
+CMD [ "python", "docker_test.py" ]
--- a/2
+++ b/2
@ -186,7 +186,7 @@
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

-   Copyright 2020 - present, Facebook, Inc
+   Copyright 2023 - present, IDEA Research.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
--- a/README.md
+++ b/README.md
@ -1,41 +1,85 @@
-# Grounding DINO 
+<div align="center">
+  <img src="./.asset/grounding_dino_logo.png" width="30%">
+</div>

---
+# :sauropod: Grounding DINO 
+
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-mscoco)](https://paperswithcode.com/sota/zero-shot-object-detection-on-mscoco?p=grounding-dino-marrying-dino-with-grounded) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-odinw)](https://paperswithcode.com/sota/zero-shot-object-detection-on-odinw?p=grounding-dino-marrying-dino-with-grounded) \
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=grounding-dino-marrying-dino-with-grounded) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=grounding-dino-marrying-dino-with-grounded)


-Grounding DINO Methods |  [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/IDEA-Research/GroundingDINO)
+**[IDEA-CVR, IDEA-Research](https://github.com/IDEA-Research)** 
+
+[Shilong Liu](http://www.lsl.zone/), [Zhaoyang Zeng](https://scholar.google.com/citations?user=U_cvvUwAAAAJ&hl=zh-CN&oi=ao), [Tianhe Ren](https://rentainhe.github.io/), [Feng Li](https://scholar.google.com/citations?user=ybRe9GcAAAAJ&hl=zh-CN), [Hao Zhang](https://scholar.google.com/citations?user=B8hPxMQAAAAJ&hl=zh-CN), [Jie Yang](https://github.com/yangjie-cv), [Chunyuan Li](https://scholar.google.com/citations?user=Zd7WmXUAAAAJ&hl=zh-CN&oi=ao), [Jianwei Yang](https://jwyang.github.io/), [Hang Su](https://scholar.google.com/citations?hl=en&user=dxN1_X0AAAAJ&view_op=list_works&sortby=pubdate), [Jun Zhu](https://scholar.google.com/citations?hl=en&user=axsP38wAAAAJ), [Lei Zhang](https://www.leizhang.org/)<sup>:email:</sup>.
+
+
+[[`Paper`](https://arxiv.org/abs/2303.05499)] [[`Demo`](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)] [[`BibTex`](#black_nib-citation)]
+
+
+PyTorch implementation and pretrained models for Grounding DINO. For details, see the paper **[Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499)**.
+
+- 🔥 **[Grounded SAM 2](https://github.com/IDEA-Research/Grounded-SAM-2)** is released now, which combines Grounding DINO with [SAM 2](https://github.com/facebookresearch/segment-anything-2) for any object tracking in open-world scenarios.
+- 🔥 **[Grounding DINO 1.5](https://github.com/IDEA-Research/Grounding-DINO-1.5-API)** is released now, which is IDEA Research's **Most Capable** Open-World Object Detection Model!
+- 🔥 **[Grounding DINO](https://arxiv.org/abs/2303.05499)** and **[Grounded SAM](https://arxiv.org/abs/2401.14159)** are now supported in Huggingface. For more convenient use, you can refer to [this documentation](https://huggingface.co/docs/transformers/model_doc/grounding-dino)
+
+## :sun_with_face: Helpful Tutorial
+
+- :grapes: [[Read our arXiv Paper](https://arxiv.org/abs/2303.05499)]
+- :apple:  [[Watch our simple introduction video on YouTube](https://youtu.be/wxWDt5UiwY8)]
+- :blossom:   &nbsp;[[Try the Colab Demo](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)]
+- :sunflower: [[Try our Official Huggingface Demo](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)]
+- :maple_leaf: [[Watch the Step by Step Tutorial about GroundingDINO by Roboflow AI](https://youtu.be/cMa77r3YrDk)]
+- :mushroom: [[GroundingDINO: Automated Dataset Annotation and Evaluation by Roboflow AI](https://youtu.be/C4NqaRBz_Kw)]
+- :hibiscus: [[Accelerate Image Annotation with SAM and GroundingDINO by Roboflow AI](https://youtu.be/oEQYStnF2l8)]
+- :white_flower: [[Autodistill: Train YOLOv8 with ZERO Annotations based on Grounding-DINO and Grounded-SAM by Roboflow AI](https://github.com/autodistill/autodistill)]
+
+<!-- Grounding DINO Methods | 
 [![arXiv](https://img.shields.io/badge/arXiv-2303.05499-b31b1b.svg)](https://arxiv.org/abs/2303.05499) 
-[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/wxWDt5UiwY8)
+[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/wxWDt5UiwY8) -->

-Grounding DINO Demos |
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)
-[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/cMa77r3YrDk)
+<!-- Grounding DINO Demos |
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb) -->
+<!-- [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/cMa77r3YrDk)
 [![HuggingFace space](https://img.shields.io/badge/🤗-HuggingFace%20Space-cyan.svg)](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)
-[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/C4NqaRBz_Kw)
+[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/oEQYStnF2l8)
+[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/C4NqaRBz_Kw) -->

-Extensions | [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb);
-[Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)
+## :sparkles: Highlight Projects
+
+- [Semantic-SAM: a universal image segmentation model to enable segment and recognize anything at any desired granularity.](https://github.com/UX-Decoder/Semantic-SAM), 
+- [DetGPT: Detect What You Need via Reasoning](https://github.com/OptimalScale/DetGPT)
+- [Grounded-SAM: Marrying Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)
+- [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb)
+- [Grounding DINO with GLIGEN for Controllable Image Editing](demo/image_editing_with_groundingdino_gligen.ipynb)
+- [OpenSeeD: A Simple and Strong Openset Segmentation Model](https://github.com/IDEA-Research/OpenSeeD)
+- [SEEM: Segment Everything Everywhere All at Once](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
+- [X-GPT: Conversational Visual Agent supported by X-Decoder](https://github.com/microsoft/X-Decoder/tree/xgpt)
+- [GLIGEN: Open-Set Grounded Text-to-Image Generation](https://github.com/gligen/GLIGEN)
+- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
+
+<!-- Extensions | [Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything); [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb); [Grounding DINO with GLIGEN](demo/image_editing_with_groundingdino_gligen.ipynb)  -->



-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-mscoco)](https://paperswithcode.com/sota/zero-shot-object-detection-on-mscoco?p=grounding-dino-marrying-dino-with-grounded) \
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-odinw)](https://paperswithcode.com/sota/zero-shot-object-detection-on-odinw?p=grounding-dino-marrying-dino-with-grounded) \
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=grounding-dino-marrying-dino-with-grounded) \
-[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=grounding-dino-marrying-dino-with-grounded)
+<!-- Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.05499), a stronger open-set object detector. Code is available now! -->


-
-Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.05499), a stronger open-set object detector. Code is available now!
-
-
-## Highlight
+## :bulb: Highlight

 - **Open-Set Detection.** Detect **everything** with language!
- **High Performancce.** COCO zero-shot **52.5 AP** (training without COCO data!). COCO fine-tune **63.0 AP**.
+- **High Performance.** COCO zero-shot **52.5 AP** (training without COCO data!). COCO fine-tune **63.0 AP**.
 - **Flexible.** Collaboration with Stable Diffusion for Image Editting.

-## News
- **`2023/04/06`**: We build a new demo by marrying GroundingDINO with [Segment-Anything](https://github.com/facebookresearch/segment-anything) named [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything) aims to support segmentation in GroundingDINO.
+
+
+
+## :fire: News
+- **`2023/07/18`**: We release [Semantic-SAM](https://github.com/UX-Decoder/Semantic-SAM), a universal image segmentation model to enable segment and recognize anything at any desired granularity. **Code** and **checkpoint** are available!
+- **`2023/06/17`**: We provide an example to evaluate Grounding DINO on COCO zero-shot performance.
+- **`2023/04/15`**: Refer to [CV in the Wild Readings](https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings) for those who are interested in open-set recognition!
+- **`2023/04/08`**: We release [demos](demo/image_editing_with_groundingdino_gligen.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [GLIGEN](https://github.com/gligen/GLIGEN)  for more controllable image editings.
+- **`2023/04/08`**: We release [demos](demo/image_editing_with_groundingdino_stablediffusion.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) for image editings.
+- **`2023/04/06`**: We build a new demo by marrying GroundingDINO with [Segment-Anything](https://github.com/facebookresearch/segment-anything) named **[Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)** aims to support segmentation in GroundingDINO.
 - **`2023/03/28`**: A YouTube [video](https://youtu.be/cMa77r3YrDk) about Grounding DINO and basic object detection prompt engineering. [[SkalskiP](https://github.com/SkalskiP)]
 - **`2023/03/28`**: Add a [demo](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo) on Hugging Face Space!
 - **`2023/03/27`**: Support CPU-only mode. Now the model can run on machines without GPUs.
@ -46,44 +90,184 @@ Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.0
 <summary><font size="4">
 Description
 </font></summary>
+ <a href="https://arxiv.org/abs/2303.05499">Paper</a> introduction.
 <img src=".asset/hero_figure.png" alt="ODinW" width="100%">
+Marrying <a href="https://github.com/IDEA-Research/GroundingDINO">Grounding DINO</a> and <a href="https://github.com/gligen/GLIGEN">GLIGEN</a>
+<img src="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/GD_GLIGEN.png" alt="gd_gligen" width="100%">
 </details>

+## :star: Explanations/Tips for Grounding DINO Inputs and Outputs
+- Grounding DINO accepts an `(image, text)` pair as inputs.
+- It outputs `900` (by default) object boxes. Each box has similarity scores across all input words. (as shown in Figures below.)
+- We defaultly choose the boxes whose highest similarities are higher than a `box_threshold`.
+- We extract the words whose similarities are higher than the `text_threshold` as predicted labels.
+- If you want to obtain objects of specific phrases, like the `dogs` in the sentence `two dogs with a stick.`, you can select the boxes with highest text similarities with `dogs` as final outputs. 
+- Note that each word can be split to **more than one** tokens with different tokenlizers. The number of words in a sentence may not equal to the number of text tokens.
+- We suggest separating different category names with `.` for Grounding DINO.
+![model_explain1](.asset/model_explan1.PNG)
+![model_explain2](.asset/model_explan2.PNG)

-
-## TODO 
+## :label: TODO 

 - [x] Release inference code and demo.
 - [x] Release checkpoints.
- [ ] Grounding DINO with Stable Diffusion and GLIGEN demos.
+- [x] Grounding DINO with Stable Diffusion and GLIGEN demos.
 - [ ] Release training codes.

-## Install 
+## :hammer_and_wrench: Install 

-If you have a CUDA environment, please make sure the environment variable `CUDA_HOME` is set. It will be compiled under CPU-only mode if no CUDA available.
+**Note:**
+
+0. If you have a CUDA environment, please make sure the environment variable `CUDA_HOME` is set. It will be compiled under CPU-only mode if no CUDA available.
+
+Please make sure following the installation steps strictly, otherwise the program may produce: 
+```bash
+NameError: name '_C' is not defined
+```
+
+If this happened, please reinstalled the groundingDINO by reclone the git and do all the installation steps again.
+ 
+#### how to check cuda:
+```bash
+echo $CUDA_HOME
+```
+If it print nothing, then it means you haven't set up the path/
+
+Run this so the environment variable will be set under current shell. 
+```bash
+export CUDA_HOME=/path/to/cuda-11.3
+```
+
+Notice the version of cuda should be aligned with your CUDA runtime, for there might exists multiple cuda at the same time. 
+
+If you want to set the CUDA_HOME permanently, store it using:
+
+```bash
+echo 'export CUDA_HOME=/path/to/cuda' >> ~/.bashrc
+```
+after that, source the bashrc file and check CUDA_HOME:
+```bash
+source ~/.bashrc
+echo $CUDA_HOME
+```
+
+In this example, /path/to/cuda-11.3 should be replaced with the path where your CUDA toolkit is installed. You can find this by typing **which nvcc** in your terminal:
+
+For instance, 
+if the output is /usr/local/cuda/bin/nvcc, then:
+```bash
+export CUDA_HOME=/usr/local/cuda
+```
+**Installation:**
+
+1.Clone the GroundingDINO repository from GitHub.
+
+```bash
+git clone https://github.com/IDEA-Research/GroundingDINO.git
+```
+
+2. Change the current directory to the GroundingDINO folder.
+
+```bash
+cd GroundingDINO/
+```
+
+3. Install the required dependencies in the current directory.

 ```bash
 pip install -e .
 ```

-## Demo
+4. Download pre-trained model weights.

 ```bash
-CUDA_VISIBLE_DEVICES=6 python demo/inference_on_a_image.py \
-  -c /path/to/config \
-  -p /path/to/checkpoint \
-  -i .asset/cats.png \
-  -o "outputs/0" \
-  -t "cat ear." \
-  [--cpu-only] # open it for cpu mode
+mkdir weights
+cd weights
+wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
+cd ..
 ```
+
+## :arrow_forward: Demo
+Check your GPU ID (only if you're using a GPU)
+
+```bash
+nvidia-smi
+```
+Replace `{GPU ID}`, `image_you_want_to_detect.jpg`, and `"dir you want to save the output"` with appropriate values in the following command
+```bash
+CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
+-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
+-p weights/groundingdino_swint_ogc.pth \
+-i image_you_want_to_detect.jpg \
+-o "dir you want to save the output" \
+-t "chair"
+ [--cpu-only] # open it for cpu mode
+```
+
+If you would like to specify the phrases to detect, here is a demo:
+```bash
+CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
+-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
+-p ./groundingdino_swint_ogc.pth \
+-i .asset/cat_dog.jpeg \
+-o logs/1111 \
+-t "There is a cat and a dog in the image ." \
+--token_spans "[[[9, 10], [11, 14]], [[19, 20], [21, 24]]]"
+ [--cpu-only] # open it for cpu mode
+```
+The token_spans specify the start and end positions of a phrases. For example, the first phrase is `[[9, 10], [11, 14]]`. `"There is a cat and a dog in the image ."[9:10] = 'a'`, `"There is a cat and a dog in the image ."[11:14] = 'cat'`. Hence it refers to the phrase `a cat` . Similarly, the `[[19, 20], [21, 24]]` refers to the phrase `a dog`.
+
 See the `demo/inference_on_a_image.py` for more details.

+**Running with Python:**
+
+```python
+from groundingdino.util.inference import load_model, load_image, predict, annotate
+import cv2
+
+model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.py", "weights/groundingdino_swint_ogc.pth")
+IMAGE_PATH = "weights/dog-3.jpeg"
+TEXT_PROMPT = "chair . person . dog ."
+BOX_TRESHOLD = 0.35
+TEXT_TRESHOLD = 0.25
+
+image_source, image = load_image(IMAGE_PATH)
+
+boxes, logits, phrases = predict(
+    model=model,
+    image=image,
+    caption=TEXT_PROMPT,
+    box_threshold=BOX_TRESHOLD,
+    text_threshold=TEXT_TRESHOLD
+)
+
+annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
+cv2.imwrite("annotated_image.jpg", annotated_frame)
+```
 **Web UI**

 We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See the file `demo/gradio_app.py` for more details.

-## Checkpoints
+**Notebooks**
+
+- We release [demos](demo/image_editing_with_groundingdino_gligen.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [GLIGEN](https://github.com/gligen/GLIGEN)  for more controllable image editings.
+- We release [demos](demo/image_editing_with_groundingdino_stablediffusion.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) for image editings.
+
+## COCO Zero-shot Evaluations
+
+We provide an example to evaluate Grounding DINO zero-shot performance on COCO. The results should be **48.5**.
+
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+python demo/test_ap_on_coco.py \
+ -c groundingdino/config/GroundingDINO_SwinT_OGC.py \
+ -p weights/groundingdino_swint_ogc.pth \
+ --anno_path /path/to/annoataions/ie/instances_val2017.json \
+ --image_dir /path/to/imagedir/ie/val2017
+```
+
+
+## :luggage: Checkpoints

 <!-- insert a table -->
 <table>
@ -105,13 +289,22 @@ We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See
      <td>Swin-T</td>
      <td>O365,GoldG,Cap4M</td>
      <td>48.4 (zero-shot) / 57.2 (fine-tune)</td>
-      <td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth">Github link</a> | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swint_ogc.pth">HF link</a></td>
+      <td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth">GitHub link</a> | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swint_ogc.pth">HF link</a></td>
      <td><a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/config/GroundingDINO_SwinT_OGC.py">link</a></td>
    </tr>
+    <tr>
+      <th>2</th>
+      <td>GroundingDINO-B</td>
+      <td>Swin-B</td>
+      <td>COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO</td>
+      <td>56.7 </td>
+      <td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth">GitHub link</a>  | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swinb_cogcoor.pth">HF link</a> 
+      <td><a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/config/GroundingDINO_SwinB_cfg.py">link</a></td>
+    </tr>
  </tbody>
 </table>

-## Results
+## :medal_military: Results

 <details open>
 <summary><font size="4">
@ -131,26 +324,27 @@ ODinW Object Detection Results
 <summary><font size="4">
 Marrying Grounding DINO with <a href="https://github.com/Stability-AI/StableDiffusion">Stable Diffusion</a> for Image Editing
 </font></summary>
-See our example: demo/image_editing_with_groundingdino_stablediffusion.ipynb .
+See our example <a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/demo/image_editing_with_groundingdino_stablediffusion.ipynb">notebook</a> for more details.
 <img src=".asset/GD_SD.png" alt="GD_SD" width="100%">
 </details>


 <details open>
 <summary><font size="4">
-Marrying Grounding DINO with <a href="https://github.com/gligen/GLIGEN">GLIGEN</a> for more Detailed Image Editing
+Marrying Grounding DINO with <a href="https://github.com/gligen/GLIGEN">GLIGEN</a> for more Detailed Image Editing.
 </font></summary>
+See our example <a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/demo/image_editing_with_groundingdino_gligen.ipynb">notebook</a> for more details.
 <img src=".asset/GD_GLIGEN.png" alt="GD_GLIGEN" width="100%">
 </details>

-## Model
+## :sauropod: Model: Grounding DINO

 Includes: a text backbone, an image backbone, a feature enhancer, a language-guided query selection, and a cross-modality decoder.

 ![arch](.asset/arch.png)


-## Acknowledgement
+## :hearts: Acknowledgement

 Our model is related to [DINO](https://github.com/IDEA-Research/DINO) and [GLIP](https://github.com/microsoft/GLIP). Thanks for their great work!

@ -159,14 +353,15 @@ We also thank great previous work including DETR, Deformable DETR, SMCA, Conditi
 Thanks [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) and [GLIGEN](https://github.com/gligen/GLIGEN) for their awesome models.


-## Citation
+## :black_nib: Citation

 If you find our work helpful for your research, please consider citing the following BibTeX entry.   

 ```bibtex
-@inproceedings{ShilongLiu2023GroundingDM,
-  title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
-  author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Hao Zhang and Jie Yang and Chunyuan Li and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang},
+@article{liu2023grounding,
+  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
+  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
+  journal={arXiv preprint arXiv:2303.05499},
  year={2023}
 }
 ```
--- a/demo/gradio_app.py
+++ b/demo/gradio_app.py
@ -16,7 +16,7 @@ import torch
 # prepare the environment
 os.system("python setup.py build develop --user")
 os.system("pip install packaging==21.3")
-os.system("pip install gradio")
+os.system("pip install gradio==3.50.2")


 warnings.filterwarnings("ignore")
--- a/demo/image_editing_with_groundingdino_gligen.ipynb
+++ b/demo/image_editing_with_groundingdino_gligen.ipynb
--- a/demo/inference_on_a_image.py
+++ b/demo/inference_on_a_image.py
@ -11,6 +11,7 @@ from groundingdino.models import build_model
 from groundingdino.util import box_ops
 from groundingdino.util.slconfig import SLConfig
 from groundingdino.util.utils import clean_state_dict, get_phrases_from_posmap
+from groundingdino.util.vl_utils import create_positive_map_from_span


 def plot_boxes_to_image(image_pil, tgt):
@ -80,7 +81,8 @@ def load_model(model_config_path, model_checkpoint_path, cpu_only=False):
    return model


-def get_grounding_output(model, image, caption, box_threshold, text_threshold, with_logits=True, cpu_only=False):
+def get_grounding_output(model, image, caption, box_threshold, text_threshold=None, with_logits=True, cpu_only=False, token_spans=None):
+    assert text_threshold is not None or token_spans is not None, "text_threshould and token_spans should not be None at the same time!"
    caption = caption.lower()
    caption = caption.strip()
    if not caption.endswith("."):
@ -90,29 +92,56 @@ def get_grounding_output(model, image, caption, box_threshold, text_threshold, w
    image = image.to(device)
    with torch.no_grad():
        outputs = model(image[None], captions=[caption])
-    logits = outputs["pred_logits"].cpu().sigmoid()[0]  # (nq, 256)
-    boxes = outputs["pred_boxes"].cpu()[0]  # (nq, 4)
-    logits.shape[0]
+    logits = outputs["pred_logits"].sigmoid()[0]  # (nq, 256)
+    boxes = outputs["pred_boxes"][0]  # (nq, 4)

    # filter output
-    logits_filt = logits.clone()
-    boxes_filt = boxes.clone()
-    filt_mask = logits_filt.max(dim=1)[0] > box_threshold
-    logits_filt = logits_filt[filt_mask]  # num_filt, 256
-    boxes_filt = boxes_filt[filt_mask]  # num_filt, 4
-    logits_filt.shape[0]
+    if token_spans is None:
+        logits_filt = logits.cpu().clone()
+        boxes_filt = boxes.cpu().clone()
+        filt_mask = logits_filt.max(dim=1)[0] > box_threshold
+        logits_filt = logits_filt[filt_mask]  # num_filt, 256
+        boxes_filt = boxes_filt[filt_mask]  # num_filt, 4
+
+        # get phrase
+        tokenlizer = model.tokenizer
+        tokenized = tokenlizer(caption)
+        # build pred
+        pred_phrases = []
+        for logit, box in zip(logits_filt, boxes_filt):
+            pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, tokenlizer)
+            if with_logits:
+                pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})")
+            else:
+                pred_phrases.append(pred_phrase)
+    else:
+        # given-phrase mode
+        positive_maps = create_positive_map_from_span(
+            model.tokenizer(text_prompt),
+            token_span=token_spans
+        ).to(image.device) # n_phrase, 256
+
+        logits_for_phrases = positive_maps @ logits.T # n_phrase, nq
+        all_logits = []
+        all_phrases = []
+        all_boxes = []
+        for (token_span, logit_phr) in zip(token_spans, logits_for_phrases):
+            # get phrase
+            phrase = ' '.join([caption[_s:_e] for (_s, _e) in token_span])
+            # get mask
+            filt_mask = logit_phr > box_threshold
+            # filt box
+            all_boxes.append(boxes[filt_mask])
+            # filt logits
+            all_logits.append(logit_phr[filt_mask])
+            if with_logits:
+                logit_phr_num = logit_phr[filt_mask]
+                all_phrases.extend([phrase + f"({str(logit.item())[:4]})" for logit in logit_phr_num])
+            else:
+                all_phrases.extend([phrase for _ in range(len(filt_mask))])
+        boxes_filt = torch.cat(all_boxes, dim=0).cpu()
+        pred_phrases = all_phrases

-    # get phrase
-    tokenlizer = model.tokenizer
-    tokenized = tokenlizer(caption)
-    # build pred
-    pred_phrases = []
-    for logit, box in zip(logits_filt, boxes_filt):
-        pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, tokenlizer)
-        if with_logits:
-            pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})")
-        else:
-            pred_phrases.append(pred_phrase)

    return boxes_filt, pred_phrases

@ -132,6 +161,12 @@ if __name__ == "__main__":

    parser.add_argument("--box_threshold", type=float, default=0.3, help="box threshold")
    parser.add_argument("--text_threshold", type=float, default=0.25, help="text threshold")
+    parser.add_argument("--token_spans", type=str, default=None, help=
+                        "The positions of start and end positions of phrases of interest. \
+                        For example, a caption is 'a cat and a dog', \
+                        if you would like to detect 'cat', the token_spans should be '[[[2, 5]], ]', since 'a cat and a dog'[2:5] is 'cat'. \
+                        if you would like to detect 'a cat', the token_spans should be '[[[0, 1], [2, 5]], ]', since 'a cat and a dog'[0:1] is 'a', and 'a cat and a dog'[2:5] is 'cat'. \
+                        ")

    parser.add_argument("--cpu-only", action="store_true", help="running on cpu only!, default=False")
    args = parser.parse_args()
@ -143,7 +178,8 @@ if __name__ == "__main__":
    text_prompt = args.text_prompt
    output_dir = args.output_dir
    box_threshold = args.box_threshold
-    text_threshold = args.box_threshold
+    text_threshold = args.text_threshold
+    token_spans = args.token_spans

    # make dir
    os.makedirs(output_dir, exist_ok=True)
@ -155,9 +191,15 @@ if __name__ == "__main__":
    # visualize raw image
    image_pil.save(os.path.join(output_dir, "raw_image.jpg"))

+    # set the text_threshold to None if token_spans is set.
+    if token_spans is not None:
+        text_threshold = None
+        print("Using token_spans. Set the text_threshold to None.")
+
+
    # run model
    boxes_filt, pred_phrases = get_grounding_output(
-        model, image, text_prompt, box_threshold, text_threshold, cpu_only=args.cpu_only
+        model, image, text_prompt, box_threshold, text_threshold, cpu_only=args.cpu_only, token_spans=eval(f"{token_spans}")
    )

    # visualize pred
--- a/demo/test_ap_on_coco.py
+++ b/demo/test_ap_on_coco.py
@ -0,0 +1,233 @@
+import argparse
+import os
+import sys
+import time
+
+import numpy as np
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader, DistributedSampler
+
+from groundingdino.models import build_model
+import groundingdino.datasets.transforms as T
+from groundingdino.util import box_ops, get_tokenlizer
+from groundingdino.util.misc import clean_state_dict, collate_fn
+from groundingdino.util.slconfig import SLConfig
+
+# from torchvision.datasets import CocoDetection
+import torchvision
+
+from groundingdino.util.vl_utils import build_captions_and_token_span, create_positive_map_from_span
+from groundingdino.datasets.cocogrounding_eval import CocoGroundingEvaluator
+
+
+def load_model(model_config_path: str, model_checkpoint_path: str, device: str = "cuda"):
+    args = SLConfig.fromfile(model_config_path)
+    args.device = device
+    model = build_model(args)
+    checkpoint = torch.load(model_checkpoint_path, map_location="cpu")
+    model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
+    model.eval()
+    return model
+
+
+class CocoDetection(torchvision.datasets.CocoDetection):
+    def __init__(self, img_folder, ann_file, transforms):
+        super().__init__(img_folder, ann_file)
+        self._transforms = transforms
+
+    def __getitem__(self, idx):
+        img, target = super().__getitem__(idx)  # target: list
+
+        # import ipdb; ipdb.set_trace()
+
+        w, h = img.size
+        boxes = [obj["bbox"] for obj in target]
+        boxes = torch.as_tensor(boxes, dtype=torch.float32).reshape(-1, 4)
+        boxes[:, 2:] += boxes[:, :2]  # xywh -> xyxy
+        boxes[:, 0::2].clamp_(min=0, max=w)
+        boxes[:, 1::2].clamp_(min=0, max=h)
+        # filt invalid boxes/masks/keypoints
+        keep = (boxes[:, 3] > boxes[:, 1]) & (boxes[:, 2] > boxes[:, 0])
+        boxes = boxes[keep]
+
+        target_new = {}
+        image_id = self.ids[idx]
+        target_new["image_id"] = image_id
+        target_new["boxes"] = boxes
+        target_new["orig_size"] = torch.as_tensor([int(h), int(w)])
+
+        if self._transforms is not None:
+            img, target = self._transforms(img, target_new)
+
+        return img, target
+
+
+class PostProcessCocoGrounding(nn.Module):
+    """ This module converts the model's output into the format expected by the coco api"""
+
+    def __init__(self, num_select=300, coco_api=None, tokenlizer=None) -> None:
+        super().__init__()
+        self.num_select = num_select
+
+        assert coco_api is not None
+        category_dict = coco_api.dataset['categories']
+        cat_list = [item['name'] for item in category_dict]
+        captions, cat2tokenspan = build_captions_and_token_span(cat_list, True)
+        tokenspanlist = [cat2tokenspan[cat] for cat in cat_list]
+        positive_map = create_positive_map_from_span(
+            tokenlizer(captions), tokenspanlist)  # 80, 256. normed
+
+        id_map = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16, 15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31, 27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43, 39: 44, 40: 46,
+                  41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56, 51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72, 63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85, 75: 86, 76: 87, 77: 88, 78: 89, 79: 90}
+
+        # build a mapping from label_id to pos_map
+        new_pos_map = torch.zeros((91, 256))
+        for k, v in id_map.items():
+            new_pos_map[v] = positive_map[k]
+        self.positive_map = new_pos_map
+
+    @torch.no_grad()
+    def forward(self, outputs, target_sizes, not_to_xyxy=False):
+        """ Perform the computation
+        Parameters:
+            outputs: raw outputs of the model
+            target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
+                          For evaluation, this must be the original image size (before any data augmentation)
+                          For visualization, this should be the image size after data augment, but before padding
+        """
+        num_select = self.num_select
+        out_logits, out_bbox = outputs['pred_logits'], outputs['pred_boxes']
+
+        # pos map to logit
+        prob_to_token = out_logits.sigmoid()  # bs, 100, 256
+        pos_maps = self.positive_map.to(prob_to_token.device)
+        # (bs, 100, 256) @ (91, 256).T -> (bs, 100, 91)
+        prob_to_label = prob_to_token @ pos_maps.T
+
+        # if os.environ.get('IPDB_SHILONG_DEBUG', None) == 'INFO':
+        #     import ipdb; ipdb.set_trace()
+
+        assert len(out_logits) == len(target_sizes)
+        assert target_sizes.shape[1] == 2
+
+        prob = prob_to_label
+        topk_values, topk_indexes = torch.topk(
+            prob.view(out_logits.shape[0], -1), num_select, dim=1)
+        scores = topk_values
+        topk_boxes = topk_indexes // prob.shape[2]
+        labels = topk_indexes % prob.shape[2]
+
+        if not_to_xyxy:
+            boxes = out_bbox
+        else:
+            boxes = box_ops.box_cxcywh_to_xyxy(out_bbox)
+
+        boxes = torch.gather(
+            boxes, 1, topk_boxes.unsqueeze(-1).repeat(1, 1, 4))
+
+        # and from relative [0, 1] to absolute [0, height] coordinates
+        img_h, img_w = target_sizes.unbind(1)
+        scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
+        boxes = boxes * scale_fct[:, None, :]
+
+        results = [{'scores': s, 'labels': l, 'boxes': b}
+                   for s, l, b in zip(scores, labels, boxes)]
+
+        return results
+
+
+def main(args):
+    # config
+    cfg = SLConfig.fromfile(args.config_file)
+
+    # build model
+    model = load_model(args.config_file, args.checkpoint_path)
+    model = model.to(args.device)
+    model = model.eval()
+
+    # build dataloader
+    transform = T.Compose(
+        [
+            T.RandomResize([800], max_size=1333),
+            T.ToTensor(),
+            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
+        ]
+    )
+    dataset = CocoDetection(
+        args.image_dir, args.anno_path, transforms=transform)
+    data_loader = DataLoader(
+        dataset, batch_size=1, shuffle=False, num_workers=args.num_workers, collate_fn=collate_fn)
+
+    # build post processor
+    tokenlizer = get_tokenlizer.get_tokenlizer(cfg.text_encoder_type)
+    postprocessor = PostProcessCocoGrounding(
+        coco_api=dataset.coco, tokenlizer=tokenlizer)
+
+    # build evaluator
+    evaluator = CocoGroundingEvaluator(
+        dataset.coco, iou_types=("bbox",), useCats=True)
+
+    # build captions
+    category_dict = dataset.coco.dataset['categories']
+    cat_list = [item['name'] for item in category_dict]
+    caption = " . ".join(cat_list) + ' .'
+    print("Input text prompt:", caption)
+
+    # run inference
+    start = time.time()
+    for i, (images, targets) in enumerate(data_loader):
+        # get images and captions
+        images = images.tensors.to(args.device)
+        bs = images.shape[0]
+        input_captions = [caption] * bs
+
+        # feed to the model
+        outputs = model(images, captions=input_captions)
+
+        orig_target_sizes = torch.stack(
+            [t["orig_size"] for t in targets], dim=0).to(images.device)
+        results = postprocessor(outputs, orig_target_sizes)
+        cocogrounding_res = {
+            target["image_id"]: output for target, output in zip(targets, results)}
+        evaluator.update(cocogrounding_res)
+
+        if (i+1) % 30 == 0:
+            used_time = time.time() - start
+            eta = len(data_loader) / (i+1e-5) * used_time - used_time
+            print(
+                f"processed {i}/{len(data_loader)} images. time: {used_time:.2f}s, ETA: {eta:.2f}s")
+
+    evaluator.synchronize_between_processes()
+    evaluator.accumulate()
+    evaluator.summarize()
+
+    print("Final results:", evaluator.coco_eval["bbox"].stats.tolist())
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        "Grounding DINO eval on COCO", add_help=True)
+    # load model
+    parser.add_argument("--config_file", "-c", type=str,
+                        required=True, help="path to config file")
+    parser.add_argument(
+        "--checkpoint_path", "-p", type=str, required=True, help="path to checkpoint file"
+    )
+    parser.add_argument("--device", type=str, default="cuda",
+                        help="running device (default: cuda)")
+
+    # post processing
+    parser.add_argument("--num_select", type=int, default=300,
+                        help="number of topk to select")
+
+    # coco info
+    parser.add_argument("--anno_path", type=str,
+                        required=True, help="coco root")
+    parser.add_argument("--image_dir", type=str,
+                        required=True, help="coco image dir")
+    parser.add_argument("--num_workers", type=int, default=4,
+                        help="number of workers for dataloader")
+    args = parser.parse_args()
+
+    main(args)
--- a/docker_test.py
+++ b/docker_test.py
@ -0,0 +1,8 @@
+from groundingdino.util.inference import load_model, load_image, predict, annotate
+import torch
+import cv2
+
+model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.pyy", "weights/groundingdino_swint_ogc.pth")
+model = model.to('cuda:0')
+print(torch.cuda.is_available())
+print('DONE!')
--- a/environment.yaml
+++ b/environment.yaml
@ -0,0 +1,248 @@
+name: dino
+channels:
+  - pytorch
+  - nvidia
+  - conda-forge
+  - defaults
+dependencies:
+  - addict=2.4.0=pyhd8ed1ab_2
+  - aiohttp=3.8.5=py39ha55989b_0
+  - aiosignal=1.3.1=pyhd8ed1ab_0
+  - asttokens=2.0.5=pyhd3eb1b0_0
+  - async-timeout=4.0.3=pyhd8ed1ab_0
+  - attrs=23.1.0=pyh71513ae_1
+  - aws-c-auth=0.7.0=h6f3c987_2
+  - aws-c-cal=0.6.0=h6ba3258_0
+  - aws-c-common=0.8.23=hcfcfb64_0
+  - aws-c-compression=0.2.17=h420beca_1
+  - aws-c-event-stream=0.3.1=had47b81_1
+  - aws-c-http=0.7.11=h72ba615_0
+  - aws-c-io=0.13.28=ha35c040_0
+  - aws-c-mqtt=0.8.14=h4941efa_2
+  - aws-c-s3=0.3.13=he04eaa7_2
+  - aws-c-sdkutils=0.1.11=h420beca_1
+  - aws-checksums=0.1.16=h420beca_1
+  - aws-crt-cpp=0.20.3=h247a981_4
+  - aws-sdk-cpp=1.10.57=h1a0519f_17
+  - backcall=0.2.0=pyhd3eb1b0_0
+  - blas=2.118=mkl
+  - blas-devel=3.9.0=18_win64_mkl
+  - brotli=1.0.9=hcfcfb64_9
+  - brotli-bin=1.0.9=hcfcfb64_9
+  - brotli-python=1.0.9=py39h99910a6_9
+  - bzip2=1.0.8=h8ffe710_4
+  - c-ares=1.19.1=hcfcfb64_0
+  - ca-certificates=2023.08.22=haa95532_0
+  - certifi=2023.7.22=py39haa95532_0
+  - charset-normalizer=3.2.0=pyhd8ed1ab_0
+  - click=8.1.7=win_pyh7428d3b_0
+  - colorama=0.4.6=pyhd8ed1ab_0
+  - comm=0.1.2=py39haa95532_0
+  - contourpy=1.1.1=py39h1f6ef14_1
+  - cuda-cccl=12.2.140=0
+  - cuda-cudart=11.8.89=0
+  - cuda-cudart-dev=11.8.89=0
+  - cuda-cupti=11.8.87=0
+  - cuda-libraries=11.8.0=0
+  - cuda-libraries-dev=11.8.0=0
+  - cuda-nvrtc=11.8.89=0
+  - cuda-nvrtc-dev=11.8.89=0
+  - cuda-nvtx=11.8.86=0
+  - cuda-profiler-api=12.2.140=0
+  - cuda-runtime=11.8.0=0
+  - cycler=0.11.0=pyhd8ed1ab_0
+  - cython=3.0.0=py39h2bbff1b_0
+  - dataclasses=0.8=pyhc8e2a94_3
+  - datasets=2.14.5=pyhd8ed1ab_0
+  - debugpy=1.6.7=py39hd77b12b_0
+  - decorator=5.1.1=pyhd3eb1b0_0
+  - dill=0.3.7=pyhd8ed1ab_0
+  - exceptiongroup=1.0.4=py39haa95532_0
+  - executing=0.8.3=pyhd3eb1b0_0
+  - filelock=3.12.4=pyhd8ed1ab_0
+  - fonttools=4.42.1=py39ha55989b_0
+  - freeglut=3.2.2=h63175ca_2
+  - freetype=2.12.1=hdaf720e_2
+  - frozenlist=1.4.0=py39ha55989b_1
+  - fsspec=2023.6.0=pyh1a96a4e_0
+  - gettext=0.21.1=h5728263_0
+  - glib=2.78.0=h12be248_0
+  - glib-tools=2.78.0=h12be248_0
+  - gst-plugins-base=1.22.6=h001b923_1
+  - gstreamer=1.22.6=hb4038d2_1
+  - huggingface_hub=0.17.3=pyhd8ed1ab_0
+  - icu=70.1=h0e60522_0
+  - idna=3.4=pyhd8ed1ab_0
+  - importlib-metadata=6.8.0=pyha770c72_0
+  - importlib-resources=6.1.0=pyhd8ed1ab_0
+  - importlib_metadata=6.8.0=hd8ed1ab_0
+  - importlib_resources=6.1.0=pyhd8ed1ab_0
+  - intel-openmp=2023.2.0=h57928b3_49503
+  - ipykernel=6.25.0=py39h9909e9c_0
+  - ipython=8.15.0=py39haa95532_0
+  - jasper=2.0.33=hc2e4405_1
+  - jedi=0.18.1=py39haa95532_1
+  - jinja2=3.1.2=pyhd8ed1ab_1
+  - joblib=1.3.2=pyhd8ed1ab_0
+  - jpeg=9e=hcfcfb64_3
+  - jupyter_client=8.1.0=py39haa95532_0
+  - jupyter_core=5.3.0=py39haa95532_0
+  - kiwisolver=1.4.5=py39h1f6ef14_1
+  - krb5=1.20.1=heb0366b_0
+  - lcms2=2.14=h90d422f_0
+  - lerc=4.0.0=h63175ca_0
+  - libabseil=20230125.3=cxx17_h63175ca_0
+  - libarrow=12.0.1=h12e5d06_5_cpu
+  - libblas=3.9.0=18_win64_mkl
+  - libbrotlicommon=1.0.9=hcfcfb64_9
+  - libbrotlidec=1.0.9=hcfcfb64_9
+  - libbrotlienc=1.0.9=hcfcfb64_9
+  - libcblas=3.9.0=18_win64_mkl
+  - libclang=15.0.7=default_h77d9078_3
+  - libclang13=15.0.7=default_h77d9078_3
+  - libcrc32c=1.1.2=h0e60522_0
+  - libcublas=11.11.3.6=0
+  - libcublas-dev=11.11.3.6=0
+  - libcufft=10.9.0.58=0
+  - libcufft-dev=10.9.0.58=0
+  - libcurand=10.3.3.141=0
+  - libcurand-dev=10.3.3.141=0
+  - libcurl=8.1.2=h68f0423_0
+  - libcusolver=11.4.1.48=0
+  - libcusolver-dev=11.4.1.48=0
+  - libcusparse=11.7.5.86=0
+  - libcusparse-dev=11.7.5.86=0
+  - libdeflate=1.14=hcfcfb64_0
+  - libevent=2.1.12=h3671451_1
+  - libffi=3.4.2=h8ffe710_5
+  - libglib=2.78.0=he8f3873_0
+  - libgoogle-cloud=2.12.0=h00b2bdc_1
+  - libgrpc=1.54.3=ha177ca7_0
+  - libhwloc=2.9.3=default_haede6df_1009
+  - libiconv=1.17=h8ffe710_0
+  - liblapack=3.9.0=18_win64_mkl
+  - liblapacke=3.9.0=18_win64_mkl
+  - libnpp=11.8.0.86=0
+  - libnpp-dev=11.8.0.86=0
+  - libnvjpeg=11.9.0.86=0
+  - libnvjpeg-dev=11.9.0.86=0
+  - libogg=1.3.4=h8ffe710_1
+  - libopencv=4.5.3=py39h488c12c_8
+  - libpng=1.6.39=h19919ed_0
+  - libprotobuf=3.21.12=h12be248_2
+  - libsodium=1.0.18=h62dcd97_0
+  - libsqlite=3.43.0=hcfcfb64_0
+  - libssh2=1.11.0=h7dfc565_0
+  - libthrift=0.18.1=h06f6336_2
+  - libtiff=4.4.0=hc4f729c_5
+  - libutf8proc=2.8.0=h82a8f57_0
+  - libuv=1.44.2=hcfcfb64_1
+  - libvorbis=1.3.7=h0e60522_0
+  - libwebp-base=1.3.2=hcfcfb64_0
+  - libxcb=1.13=hcd874cb_1004
+  - libxml2=2.11.5=hc3477c8_1
+  - libzlib=1.2.13=hcfcfb64_5
+  - lz4-c=1.9.4=hcfcfb64_0
+  - m2w64-gcc-libgfortran=5.3.0=6
+  - m2w64-gcc-libs=5.3.0=7
+  - m2w64-gcc-libs-core=5.3.0=7
+  - m2w64-gmp=6.1.0=2
+  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
+  - markupsafe=2.1.3=py39ha55989b_1
+  - matplotlib-base=3.8.0=py39hf19769e_1
+  - matplotlib-inline=0.1.6=py39haa95532_0
+  - mkl=2022.1.0=h6a75c08_874
+  - mkl-devel=2022.1.0=h57928b3_875
+  - mkl-include=2022.1.0=h6a75c08_874
+  - mpmath=1.3.0=pyhd8ed1ab_0
+  - msys2-conda-epoch=20160418=1
+  - multidict=6.0.4=py39ha55989b_0
+  - multiprocess=0.70.15=py39ha55989b_1
+  - munkres=1.1.4=pyh9f0ad1d_0
+  - nest-asyncio=1.5.6=py39haa95532_0
+  - networkx=3.1=pyhd8ed1ab_0
+  - numpy=1.26.0=py39hddb5d58_0
+  - opencv=4.5.3=py39hcbf5309_8
+  - openjpeg=2.5.0=hc9384bd_1
+  - openssl=3.1.3=hcfcfb64_0
+  - orc=1.9.0=hada7b9e_1
+  - packaging=23.1=pyhd8ed1ab_0
+  - pandas=2.1.1=py39h32e6231_0
+  - parso=0.8.3=pyhd3eb1b0_0
+  - pcre2=10.40=h17e33f8_0
+  - pickleshare=0.7.5=pyhd3eb1b0_1003
+  - pillow=9.2.0=py39h595c93f_3
+  - pip=23.2.1=pyhd8ed1ab_0
+  - platformdirs=3.10.0=pyhd8ed1ab_0
+  - prompt-toolkit=3.0.36=py39haa95532_0
+  - psutil=5.9.0=py39h2bbff1b_0
+  - pthread-stubs=0.4=hcd874cb_1001
+  - pthreads-win32=2.9.1=hfa6e2cd_3
+  - pure_eval=0.2.2=pyhd3eb1b0_0
+  - py-opencv=4.5.3=py39h00e5391_8
+  - pyarrow=12.0.1=py39hca4e8af_5_cpu
+  - pycocotools=2.0.6=py39hc266a54_1
+  - pygments=2.15.1=py39haa95532_1
+  - pyparsing=3.1.1=pyhd8ed1ab_0
+  - pysocks=1.7.1=pyh0701188_6
+  - python=3.9.18=h4de0772_0_cpython
+  - python-dateutil=2.8.2=pyhd8ed1ab_0
+  - python-tzdata=2023.3=pyhd8ed1ab_0
+  - python-xxhash=3.3.0=py39ha55989b_1
+  - python_abi=3.9=4_cp39
+  - pytorch=2.0.1=py3.9_cuda11.8_cudnn8_0
+  - pytorch-cuda=11.8=h24eeafa_5
+  - pytorch-mutex=1.0=cuda
+  - pytz=2023.3.post1=pyhd8ed1ab_0
+  - pywin32=305=py39h2bbff1b_0
+  - pyyaml=6.0.1=py39ha55989b_1
+  - pyzmq=25.1.0=py39hd77b12b_0
+  - qt-main=5.15.8=h720456b_6
+  - re2=2023.03.02=hd4eee63_0
+  - regex=2023.8.8=py39ha55989b_1
+  - requests=2.31.0=pyhd8ed1ab_0
+  - sacremoses=0.0.53=pyhd8ed1ab_0
+  - safetensors=0.3.3=py39hf21820d_1
+  - setuptools=68.2.2=pyhd8ed1ab_0
+  - six=1.16.0=pyh6c4a22f_0
+  - snappy=1.1.10=hfb803bf_0
+  - stack_data=0.2.0=pyhd3eb1b0_0
+  - sympy=1.12=pyh04b8f61_3
+  - tbb=2021.10.0=h91493d7_1
+  - timm=0.9.7=pyhd8ed1ab_0
+  - tk=8.6.13=hcfcfb64_0
+  - tokenizers=0.13.3=py39hca44cb7_0
+  - tomli=2.0.1=pyhd8ed1ab_0
+  - tornado=6.3.2=py39h2bbff1b_0
+  - tqdm=4.66.1=pyhd8ed1ab_0
+  - traitlets=5.7.1=py39haa95532_0
+  - transformers=4.33.2=pyhd8ed1ab_0
+  - typing-extensions=4.8.0=hd8ed1ab_0
+  - typing_extensions=4.8.0=pyha770c72_0
+  - tzdata=2023c=h71feb2d_0
+  - ucrt=10.0.22621.0=h57928b3_0
+  - unicodedata2=15.0.0=py39ha55989b_1
+  - urllib3=2.0.5=pyhd8ed1ab_0
+  - vc=14.3=h64f974e_17
+  - vc14_runtime=14.36.32532=hdcecf7f_17
+  - vs2015_runtime=14.36.32532=h05e6639_17
+  - wcwidth=0.2.5=pyhd3eb1b0_0
+  - wheel=0.41.2=pyhd8ed1ab_0
+  - win_inet_pton=1.1.0=pyhd8ed1ab_6
+  - xorg-libxau=1.0.11=hcd874cb_0
+  - xorg-libxdmcp=1.1.3=hcd874cb_0
+  - xxhash=0.8.2=hcfcfb64_0
+  - xz=5.2.6=h8d14728_0
+  - yaml=0.2.5=h8ffe710_2
+  - yapf=0.40.1=pyhd8ed1ab_0
+  - yarl=1.9.2=py39ha55989b_0
+  - zeromq=4.3.4=hd77b12b_0
+  - zipp=3.17.0=pyhd8ed1ab_0
+  - zlib=1.2.13=hcfcfb64_5
+  - zstd=1.5.5=h12be248_0
+  - pip:
+      - opencv-python==4.8.0.76
+      - supervision==0.6.0
+      - torchaudio==2.0.2
+      - torchvision==0.15.2
+prefix: C:\Users\Makoto\miniconda3\envs\dino
--- a/groundingdino/config/GroundingDINO_SwinB_cfg.py
+++ b/groundingdino/config/GroundingDINO_SwinB_cfg.py
@ -0,0 +1,43 @@
+batch_size = 1
+modelname = "groundingdino"
+backbone = "swin_B_384_22k"
+position_embedding = "sine"
+pe_temperatureH = 20
+pe_temperatureW = 20
+return_interm_indices = [1, 2, 3]
+backbone_freeze_keywords = None
+enc_layers = 6
+dec_layers = 6
+pre_norm = False
+dim_feedforward = 2048
+hidden_dim = 256
+dropout = 0.0
+nheads = 8
+num_queries = 900
+query_dim = 4
+num_patterns = 0
+num_feature_levels = 4
+enc_n_points = 4
+dec_n_points = 4
+two_stage_type = "standard"
+two_stage_bbox_embed_share = False
+two_stage_class_embed_share = False
+transformer_activation = "relu"
+dec_pred_bbox_embed_share = True
+dn_box_noise_scale = 1.0
+dn_label_noise_ratio = 0.5
+dn_label_coef = 1.0
+dn_bbox_coef = 1.0
+embed_init_tgt = True
+dn_labelbook_size = 2000
+max_text_len = 256
+text_encoder_type = "bert-base-uncased"
+use_text_enhancer = True
+use_fusion_layer = True
+use_checkpoint = True
+use_transformer_ckpt = True
+use_text_cross_attention = True
+text_dropout = 0.0
+fusion_dropout = 0.0
+fusion_droppath = 0.1
+sub_sentence_present = True
--- a/groundingdino/config/init.py
+++ b/groundingdino/config/init.py
--- a/groundingdino/datasets/init.py
+++ b/groundingdino/datasets/init.py
--- a/groundingdino/datasets/cocogrounding_eval.py
+++ b/groundingdino/datasets/cocogrounding_eval.py
@ -0,0 +1,269 @@
+# ------------------------------------------------------------------------
+# Grounding DINO. Midified by Shilong Liu.
+# url: https://github.com/IDEA-Research/GroundingDINO
+# Copyright (c) 2023 IDEA. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
+# ------------------------------------------------------------------------
+# Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+"""
+COCO evaluator that works in distributed mode.
+
+Mostly copy-paste from https://github.com/pytorch/vision/blob/edfd5a7/references/detection/coco_eval.py
+The difference is that there is less copy-pasting from pycocotools
+in the end of the file, as python3 can suppress prints with contextlib
+"""
+import contextlib
+import copy
+import os
+
+import numpy as np
+import pycocotools.mask as mask_util
+import torch
+from pycocotools.coco import COCO
+from pycocotools.cocoeval import COCOeval
+
+from groundingdino.util.misc import all_gather
+
+
+class CocoGroundingEvaluator(object):
+    def __init__(self, coco_gt, iou_types, useCats=True):
+        assert isinstance(iou_types, (list, tuple))
+        coco_gt = copy.deepcopy(coco_gt)
+        self.coco_gt = coco_gt
+
+        self.iou_types = iou_types
+        self.coco_eval = {}
+        for iou_type in iou_types:
+            self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
+            self.coco_eval[iou_type].useCats = useCats
+
+        self.img_ids = []
+        self.eval_imgs = {k: [] for k in iou_types}
+        self.useCats = useCats
+
+    def update(self, predictions):
+        img_ids = list(np.unique(list(predictions.keys())))
+        self.img_ids.extend(img_ids)
+
+        for iou_type in self.iou_types:
+            results = self.prepare(predictions, iou_type)
+
+            # suppress pycocotools prints
+            with open(os.devnull, "w") as devnull:
+                with contextlib.redirect_stdout(devnull):
+                    coco_dt = COCO.loadRes(self.coco_gt, results) if results else COCO()
+
+            coco_eval = self.coco_eval[iou_type]
+
+            coco_eval.cocoDt = coco_dt
+            coco_eval.params.imgIds = list(img_ids)
+            coco_eval.params.useCats = self.useCats
+            img_ids, eval_imgs = evaluate(coco_eval)
+
+            self.eval_imgs[iou_type].append(eval_imgs)
+
+    def synchronize_between_processes(self):
+        for iou_type in self.iou_types:
+            self.eval_imgs[iou_type] = np.concatenate(self.eval_imgs[iou_type], 2)
+            create_common_coco_eval(self.coco_eval[iou_type], self.img_ids, self.eval_imgs[iou_type])
+
+    def accumulate(self):
+        for coco_eval in self.coco_eval.values():
+            coco_eval.accumulate()
+
+    def summarize(self):
+        for iou_type, coco_eval in self.coco_eval.items():
+            print("IoU metric: {}".format(iou_type))
+            coco_eval.summarize()
+
+    def prepare(self, predictions, iou_type):
+        if iou_type == "bbox":
+            return self.prepare_for_coco_detection(predictions)
+        elif iou_type == "segm":
+            return self.prepare_for_coco_segmentation(predictions)
+        elif iou_type == "keypoints":
+            return self.prepare_for_coco_keypoint(predictions)
+        else:
+            raise ValueError("Unknown iou type {}".format(iou_type))
+
+    def prepare_for_coco_detection(self, predictions):
+        coco_results = []
+        for original_id, prediction in predictions.items():
+            if len(prediction) == 0:
+                continue
+
+            boxes = prediction["boxes"]
+            boxes = convert_to_xywh(boxes).tolist()
+            scores = prediction["scores"].tolist()
+            labels = prediction["labels"].tolist()
+
+            coco_results.extend(
+                [
+                    {
+                        "image_id": original_id,
+                        "category_id": labels[k],
+                        "bbox": box,
+                        "score": scores[k],
+                    }
+                    for k, box in enumerate(boxes)
+                ]
+            )
+        return coco_results
+
+    def prepare_for_coco_segmentation(self, predictions):
+        coco_results = []
+        for original_id, prediction in predictions.items():
+            if len(prediction) == 0:
+                continue
+
+            scores = prediction["scores"]
+            labels = prediction["labels"]
+            masks = prediction["masks"]
+
+            masks = masks > 0.5
+
+            scores = prediction["scores"].tolist()
+            labels = prediction["labels"].tolist()
+
+            rles = [
+                mask_util.encode(np.array(mask[0, :, :, np.newaxis], dtype=np.uint8, order="F"))[0] 
+                for mask in masks
+            ]
+            for rle in rles:
+                rle["counts"] = rle["counts"].decode("utf-8")
+
+            coco_results.extend(
+                [
+                    {
+                        "image_id": original_id,
+                        "category_id": labels[k],
+                        "segmentation": rle,
+                        "score": scores[k],
+                    }
+                    for k, rle in enumerate(rles)
+                ]
+            )
+        return coco_results
+
+    def prepare_for_coco_keypoint(self, predictions):
+        coco_results = []
+        for original_id, prediction in predictions.items():
+            if len(prediction) == 0:
+                continue
+
+            boxes = prediction["boxes"]
+            boxes = convert_to_xywh(boxes).tolist()
+            scores = prediction["scores"].tolist()
+            labels = prediction["labels"].tolist()
+            keypoints = prediction["keypoints"]
+            keypoints = keypoints.flatten(start_dim=1).tolist()
+
+            coco_results.extend(
+                [
+                    {
+                        "image_id": original_id,
+                        "category_id": labels[k],
+                        "keypoints": keypoint,
+                        "score": scores[k],
+                    }
+                    for k, keypoint in enumerate(keypoints)
+                ]
+            )
+        return coco_results
+
+
+def convert_to_xywh(boxes):
+    xmin, ymin, xmax, ymax = boxes.unbind(1)
+    return torch.stack((xmin, ymin, xmax - xmin, ymax - ymin), dim=1)
+
+
+def merge(img_ids, eval_imgs):
+    all_img_ids = all_gather(img_ids)
+    all_eval_imgs = all_gather(eval_imgs)
+
+    merged_img_ids = []
+    for p in all_img_ids:
+        merged_img_ids.extend(p)
+
+    merged_eval_imgs = []
+    for p in all_eval_imgs:
+        merged_eval_imgs.append(p)
+
+    merged_img_ids = np.array(merged_img_ids)
+    merged_eval_imgs = np.concatenate(merged_eval_imgs, 2)
+
+    # keep only unique (and in sorted order) images
+    merged_img_ids, idx = np.unique(merged_img_ids, return_index=True)
+    merged_eval_imgs = merged_eval_imgs[..., idx]
+
+    return merged_img_ids, merged_eval_imgs
+
+
+def create_common_coco_eval(coco_eval, img_ids, eval_imgs):
+    img_ids, eval_imgs = merge(img_ids, eval_imgs)
+    img_ids = list(img_ids)
+    eval_imgs = list(eval_imgs.flatten())
+
+    coco_eval.evalImgs = eval_imgs
+    coco_eval.params.imgIds = img_ids
+    coco_eval._paramsEval = copy.deepcopy(coco_eval.params)
+
+
+#################################################################
+# From pycocotools, just removed the prints and fixed
+# a Python3 bug about unicode not defined
+#################################################################
+
+
+def evaluate(self):
+    """
+    Run per image evaluation on given images and store results (a list of dict) in self.evalImgs
+    :return: None
+    """
+    # tic = time.time()
+    # print('Running per image evaluation...')
+    p = self.params
+    # add backward compatibility if useSegm is specified in params
+    if p.useSegm is not None:
+        p.iouType = "segm" if p.useSegm == 1 else "bbox"
+        print("useSegm (deprecated) is not None. Running {} evaluation".format(p.iouType))
+    # print('Evaluate annotation type *{}*'.format(p.iouType))
+    p.imgIds = list(np.unique(p.imgIds))
+    if p.useCats:
+        p.catIds = list(np.unique(p.catIds))
+    p.maxDets = sorted(p.maxDets)
+    self.params = p
+
+    self._prepare()
+    # loop through images, area range, max detection number
+    catIds = p.catIds if p.useCats else [-1]
+
+    if p.iouType == "segm" or p.iouType == "bbox":
+        computeIoU = self.computeIoU
+    elif p.iouType == "keypoints":
+        computeIoU = self.computeOks
+    self.ious = {
+        (imgId, catId): computeIoU(imgId, catId) 
+        for imgId in p.imgIds 
+        for catId in catIds}
+
+    evaluateImg = self.evaluateImg
+    maxDet = p.maxDets[-1]
+    evalImgs = [
+        evaluateImg(imgId, catId, areaRng, maxDet) 
+        for catId in catIds 
+        for areaRng in p.areaRng 
+        for imgId in p.imgIds
+    ]
+    # this is NOT in the pycocotools code, but could be done outside
+    evalImgs = np.asarray(evalImgs).reshape(len(catIds), len(p.areaRng), len(p.imgIds))
+    self._paramsEval = copy.deepcopy(self.params)
+    # toc = time.time()
+    # print('DONE (t={:0.2f}s).'.format(toc-tic))
+    return p.imgIds, evalImgs
+
+
+#################################################################
+# end of straight copy from pycocotools, just removing the prints
+#################################################################
--- a/groundingdino/models/GroundingDINO/groundingdino.py
+++ b/groundingdino/models/GroundingDINO/groundingdino.py
@ -206,6 +206,21 @@ class GroundingDINO(nn.Module):
            nn.init.xavier_uniform_(proj[0].weight, gain=1)
            nn.init.constant_(proj[0].bias, 0)

+    def set_image_tensor(self, samples: NestedTensor):
+        if isinstance(samples, (list, torch.Tensor)):
+            samples = nested_tensor_from_tensor_list(samples)
+        self.features, self.poss = self.backbone(samples)
+
+    def unset_image_tensor(self):
+        if hasattr(self, 'features'):
+            del self.features
+        if hasattr(self,'poss'):
+            del self.poss 
+
+    def set_image_features(self, features , poss):
+        self.features = features
+        self.poss = poss
+
    def init_ref_points(self, use_num_queries):
        self.refpoint_embed = nn.Embedding(use_num_queries, self.query_dim)

@ -228,7 +243,6 @@ class GroundingDINO(nn.Module):
            captions = kw["captions"]
        else:
            captions = [t["caption"] for t in targets]
-        len(captions)

        # encoder texts
        tokenized = self.tokenizer(captions, padding="longest", return_tensors="pt").to(
@ -283,14 +297,14 @@ class GroundingDINO(nn.Module):
        }

        # import ipdb; ipdb.set_trace()
-
        if isinstance(samples, (list, torch.Tensor)):
            samples = nested_tensor_from_tensor_list(samples)
-        features, poss = self.backbone(samples)
+        if not hasattr(self, 'features') or not hasattr(self, 'poss'):
+            self.set_image_tensor(samples)

        srcs = []
        masks = []
-        for l, feat in enumerate(features):
+        for l, feat in enumerate(self.features):
            src, mask = feat.decompose()
            srcs.append(self.input_proj[l](src))
            masks.append(mask)
@ -299,7 +313,7 @@ class GroundingDINO(nn.Module):
            _len_srcs = len(srcs)
            for l in range(_len_srcs, self.num_feature_levels):
                if l == _len_srcs:
-                    src = self.input_proj[l](features[-1].tensors)
+                    src = self.input_proj[l](self.features[-1].tensors)
                else:
                    src = self.input_proj[l](srcs[-1])
                m = samples.mask
@ -307,11 +321,11 @@ class GroundingDINO(nn.Module):
                pos_l = self.backbone[1](NestedTensor(src, mask)).to(src.dtype)
                srcs.append(src)
                masks.append(mask)
-                poss.append(pos_l)
+                self.poss.append(pos_l)

        input_query_bbox = input_query_label = attn_mask = dn_meta = None
        hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
-            srcs, masks, input_query_bbox, poss, input_query_label, attn_mask, text_dict
+            srcs, masks, input_query_bbox, self.poss, input_query_label, attn_mask, text_dict
        )

        # deformable-detr-like anchor update
@ -345,7 +359,9 @@ class GroundingDINO(nn.Module):
        #     interm_class = self.transformer.enc_out_class_embed(hs_enc[-1], text_dict)
        #     out['interm_outputs'] = {'pred_logits': interm_class, 'pred_boxes': interm_coord}
        #     out['interm_outputs_for_matching_pre'] = {'pred_logits': interm_class, 'pred_boxes': init_box_proposal}
-
+        unset_image_tensor = kw.get('unset_image_tensor', True)
+        if unset_image_tensor:
+            self.unset_image_tensor() ## If necessary
        return out

    @torch.jit.unused
@ -393,3 +409,4 @@ def build_groundingdino(args):
    )

    return model
+
--- a/groundingdino/util/get_tokenlizer.py
+++ b/groundingdino/util/get_tokenlizer.py
@ -1,5 +1,5 @@
 from transformers import AutoTokenizer, BertModel, BertTokenizer, RobertaModel, RobertaTokenizerFast
-
+import os

 def get_tokenlizer(text_encoder_type):
    if not isinstance(text_encoder_type, str):
@ -8,6 +8,8 @@ def get_tokenlizer(text_encoder_type):
            text_encoder_type = text_encoder_type.text_encoder_type
        elif text_encoder_type.get("text_encoder_type", False):
            text_encoder_type = text_encoder_type.get("text_encoder_type")
+        elif os.path.isdir(text_encoder_type) and os.path.exists(text_encoder_type):
+            pass
        else:
            raise ValueError(
                "Unknown type of text_encoder_type: {}".format(type(text_encoder_type))
@ -19,8 +21,9 @@ def get_tokenlizer(text_encoder_type):


 def get_pretrained_language_model(text_encoder_type):
-    if text_encoder_type == "bert-base-uncased":
+    if text_encoder_type == "bert-base-uncased" or (os.path.isdir(text_encoder_type) and os.path.exists(text_encoder_type)):
        return BertModel.from_pretrained(text_encoder_type)
    if text_encoder_type == "roberta-base":
        return RobertaModel.from_pretrained(text_encoder_type)
+
    raise ValueError("Unknown text_encoder_type {}".format(text_encoder_type))
--- a/groundingdino/util/inference.py
+++ b/groundingdino/util/inference.py
@ -6,6 +6,7 @@ import supervision as sv
 import torch
 from PIL import Image
 from torchvision.ops import box_convert
+import bisect

 import groundingdino.datasets.transforms as T
 from groundingdino.models import build_model
@ -55,7 +56,8 @@ def predict(
        caption: str,
        box_threshold: float,
        text_threshold: float,
-        device: str = "cuda"
+        device: str = "cuda",
+        remove_combined: bool = False
 ) -> Tuple[torch.Tensor, torch.Tensor, List[str]]:
    caption = preprocess_caption(caption=caption)

@ -74,17 +76,40 @@ def predict(

    tokenizer = model.tokenizer
    tokenized = tokenizer(caption)
-
-    phrases = [
-        get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
-        for logit
-        in logits
-    ]
+    
+    if remove_combined:
+        sep_idx = [i for i in range(len(tokenized['input_ids'])) if tokenized['input_ids'][i] in [101, 102, 1012]]
+        
+        phrases = []
+        for logit in logits:
+            max_idx = logit.argmax()
+            insert_idx = bisect.bisect_left(sep_idx, max_idx)
+            right_idx = sep_idx[insert_idx]
+            left_idx = sep_idx[insert_idx - 1]
+            phrases.append(get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer, left_idx, right_idx).replace('.', ''))
+    else:
+        phrases = [
+            get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
+            for logit
+            in logits
+        ]

    return boxes, logits.max(dim=1)[0], phrases


 def annotate(image_source: np.ndarray, boxes: torch.Tensor, logits: torch.Tensor, phrases: List[str]) -> np.ndarray:
+    """    
+    This function annotates an image with bounding boxes and labels.
+
+    Parameters:
+    image_source (np.ndarray): The source image to be annotated.
+    boxes (torch.Tensor): A tensor containing bounding box coordinates.
+    logits (torch.Tensor): A tensor containing confidence scores for each bounding box.
+    phrases (List[str]): A list of labels for each bounding box.
+
+    Returns:
+    np.ndarray: The annotated image.
+    """
    h, w, _ = image_source.shape
    boxes = boxes * torch.Tensor([w, h, w, h])
    xyxy = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()
@ -96,9 +121,11 @@ def annotate(image_source: np.ndarray, boxes: torch.Tensor, logits: torch.Tensor
        in zip(phrases, logits)
    ]

-    box_annotator = sv.BoxAnnotator()
+    bbox_annotator = sv.BoxAnnotator(color_lookup=sv.ColorLookup.INDEX)
+    label_annotator = sv.LabelAnnotator(color_lookup=sv.ColorLookup.INDEX)
    annotated_frame = cv2.cvtColor(image_source, cv2.COLOR_RGB2BGR)
-    annotated_frame = box_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+    annotated_frame = bbox_annotator.annotate(scene=annotated_frame, detections=detections)
+    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
    return annotated_frame


@ -153,7 +180,8 @@ class Model:
            image=processed_image,
            caption=caption,
            box_threshold=box_threshold,
-            text_threshold=text_threshold)
+            text_threshold=text_threshold, 
+            device=self.device)
        source_h, source_w, _ = image.shape
        detections = Model.post_process_result(
            source_h=source_h,
@ -188,14 +216,15 @@ class Model:
        box_annotator = sv.BoxAnnotator()
        annotated_image = box_annotator.annotate(scene=image, detections=detections)
        """
-        caption = ", ".join(classes)
+        caption = ". ".join(classes)
        processed_image = Model.preprocess_image(image_bgr=image).to(self.device)
        boxes, logits, phrases = predict(
            model=self.model,
            image=processed_image,
            caption=caption,
            box_threshold=box_threshold,
-            text_threshold=text_threshold)
+            text_threshold=text_threshold,
+            device=self.device)
        source_h, source_w, _ = image.shape
        detections = Model.post_process_result(
            source_h=source_h,
@ -235,8 +264,10 @@ class Model:
    def phrases2classes(phrases: List[str], classes: List[str]) -> np.ndarray:
        class_ids = []
        for phrase in phrases:
-            try:
-                class_ids.append(classes.index(phrase))
-            except ValueError:
+            for class_ in classes:
+                if class_ in phrase:
+                    class_ids.append(classes.index(class_))
+                    break
+            else:
                class_ids.append(None)
        return np.array(class_ids)
--- a/groundingdino/util/slconfig.py
+++ b/groundingdino/util/slconfig.py
@ -2,6 +2,7 @@
 # Modified from mmcv
 # ==========================================================
 import ast
+import os
 import os.path as osp
 import shutil
 import sys
@ -80,6 +81,8 @@ class SLConfig(object):
            with tempfile.TemporaryDirectory() as temp_config_dir:
                temp_config_file = tempfile.NamedTemporaryFile(dir=temp_config_dir, suffix=".py")
                temp_config_name = osp.basename(temp_config_file.name)
+                if os.name == 'nt':
+                    temp_config_file.close()
                shutil.copyfile(filename, osp.join(temp_config_dir, temp_config_name))
                temp_module_name = osp.splitext(temp_config_name)[0]
                sys.path.insert(0, temp_config_dir)
--- a/groundingdino/util/utils.py
+++ b/groundingdino/util/utils.py
@ -597,10 +597,12 @@ def targets_to(targets: List[Dict[str, Any]], device):


 def get_phrases_from_posmap(
-    posmap: torch.BoolTensor, tokenized: Dict, tokenizer: AutoTokenizer
+    posmap: torch.BoolTensor, tokenized: Dict, tokenizer: AutoTokenizer, left_idx: int = 0, right_idx: int = 255
 ):
    assert isinstance(posmap, torch.Tensor), "posmap must be torch.Tensor"
    if posmap.dim() == 1:
+        posmap[0: left_idx + 1] = False
+        posmap[right_idx:] = False
        non_zero_idx = posmap.nonzero(as_tuple=True)[0].tolist()
        token_ids = [tokenized["input_ids"][i] for i in non_zero_idx]
        return tokenizer.decode(token_ids)
--- a/groundingdino/version.py
+++ b/groundingdino/version.py
@ -1 +0,0 @@
-__version__ = "0.1.0"
--- a/requirements.txt
+++ b/requirements.txt
@ -6,5 +6,5 @@ yapf
 timm
 numpy
 opencv-python
-supervision==0.4.0
-pycocotools
+supervision>=0.22.0
+pycocotools
--- a/setup.py
+++ b/setup.py
@ -24,6 +24,18 @@ import glob
 import os
 import subprocess

+import subprocess
+import sys
+
+def install_torch():
+    try:
+        import torch
+    except ImportError:
+        subprocess.check_call([sys.executable, "-m", "pip", "install", "torch"])
+
+# Call the function to ensure torch is installed
+install_torch()
+
 import torch
 from setuptools import find_packages, setup
 from torch.utils.cpp_extension import CUDA_HOME, CppExtension, CUDAExtension
@ -70,7 +82,7 @@ def get_extensions():
    extra_compile_args = {"cxx": []}
    define_macros = []

-    if torch.cuda.is_available() and CUDA_HOME is not None:
+    if CUDA_HOME is not None and (torch.cuda.is_available() or "TORCH_CUDA_ARCH_LIST" in os.environ):
        print("Compiling with CUDA")
        extension = CUDAExtension
        sources += source_cuda
--- a/test.ipynb
+++ b/test.ipynb
@ -0,0 +1,114 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "final text_encoder_type: bert-base-uncased\n"
+     ]
+    },
+    {
+     "data": {
+      "application/json": {
+       "ascii": false,
+       "bar_format": null,
+       "colour": null,
+       "elapsed": 0.014210224151611328,
+       "initial": 0,
+       "n": 0,
+       "ncols": null,
+       "nrows": null,
+       "postfix": null,
+       "prefix": "Downloading model.safetensors",
+       "rate": null,
+       "total": 440449768,
+       "unit": "B",
+       "unit_divisor": 1000,
+       "unit_scale": true
+      },
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "5922f34578364d36afa13de9f01254bd",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/root/miniconda3/lib/python3.8/site-packages/transformers/modeling_utils.py:881: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.\n",
+      "  warnings.warn(\n",
+      "/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None\n",
+      "  warnings.warn(\"None of the inputs have requires_grad=True. Gradients will be None\")\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from groundingdino.util.inference import load_model, load_image, predict, annotate\n",
+    "import cv2\n",
+    "\n",
+    "model = load_model(\"groundingdino/config/GroundingDINO_SwinT_OGC.py\", \"../04-06-segment-anything/weights/groundingdino_swint_ogc.pth\")\n",
+    "IMAGE_PATH = \".asset/cat_dog.jpeg\"\n",
+    "TEXT_PROMPT = \"chair . person . dog .\"\n",
+    "BOX_TRESHOLD = 0.35\n",
+    "TEXT_TRESHOLD = 0.25\n",
+    "\n",
+    "image_source, image = load_image(IMAGE_PATH)\n",
+    "\n",
+    "boxes, logits, phrases = predict(\n",
+    "    model=model,\n",
+    "    image=image,\n",
+    "    caption=TEXT_PROMPT,\n",
+    "    box_threshold=BOX_TRESHOLD,\n",
+    "    text_threshold=TEXT_TRESHOLD\n",
+    ")\n",
+    "\n",
+    "annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)\n",
+    "cv2.imwrite(\"annotated_image.jpg\", annotated_frame)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
Author	SHA1	Message	Date
Ren Tianhe	856dde20ae	Grounded SAM 2 Release	2024-08-12 16:52:02 +08:00
Piotr Skalski	5a890bd867	Merge pull request #342 from ethanlee928/main fix Supervision depreciation of BoxAnnotator	2024-07-24 08:59:41 +02:00
Piotr Skalski	e49e881edd	Merge branch 'main' into main	2024-07-24 08:58:19 +02:00
ethanlee928	8b6a55f612	replaced BoundingBoxAnnotator with BoxAnnotator, updated Supervision version	2024-07-23 23:19:52 +08:00
Piotr Skalski	e27a646ca0	Update requirements.txt `supervision==0.22.0` is deprecating `BoxAnnotator`. I'm freezing the `supervision` version to prevent any problems.	2024-07-12 12:27:24 +02:00
ethanlee928	d75c95daf6	fix Supervision depreciation of BoxAnnotator	2024-06-29 01:10:48 +08:00
Ren Tianhe	df5b48a3ef	Update README.md	2024-05-23 20:10:37 +08:00
Ren Tianhe	4330960fa7	Grounding DINO 1.5 Release	2024-05-18 13:36:18 +08:00
JunX	16e0ccdb7d	Update gradio_app.py (#318 )	2024-04-22 16:01:29 +08:00
Ikko Eltociear Ashimine	3a2b344737	Update README.md (#322 ) Performancce -> Performance	2024-04-14 13:57:39 +08:00
Rohan Manzoor	c023468faf	Added Dockerfile along with a file to test Docker (#307 )	2024-03-11 16:41:39 +08:00
ASHWIN UNNIKRISHNAN	d13643262e	Update inference.py (#298 )	2024-02-23 15:10:00 +08:00
Mehmet Deniz Birlikci	2b62f419c2	Update setup.py (#269 )	2023-12-31 09:22:45 +08:00
Hardik Dava	27024e42da	Update requirements.txt (#265 )	2023-12-19 23:33:25 +08:00
Songming Liu	16e6b4bfcf	Fix an incorrect link in README (#254 )	2023-11-25 22:07:07 +08:00
sdy623	03198a2a79	Add the environment.yaml for Anaconda3 (#229 )	2023-11-13 13:53:24 +08:00
Kazuto Murase	fbb2532bb0	eval empty token_spans properly (#191 )	2023-11-13 13:53:12 +08:00
Shiyu	eeba084341	[fix] replace ema_model with model in demo/test_ap_on_coco (#242 )	2023-11-13 13:52:37 +08:00
jishnujp-vp	60d796825e	decoupled image processing from the main flow (#160 )	2023-07-22 22:08:59 -07:00
Tony Wang	5bb6543346	Readme: Add more Installation details (#177 ) * test functionality * add more steps for installation, so the CUDA_HOME can be set correctly	2023-07-22 22:08:41 -07:00
Shilong Liu	b520c15790	Update README.md add semantic sam	2023-07-18 14:20:51 -07:00
Ren Tianhe	6c27bc76b9	Update README.md	2023-06-29 17:03:15 +08:00
SlongLiu	c4c2d69fb4	fix readme for phrase grounding mode	2023-06-29 14:14:28 +08:00
SlongLiu	2452fa38d5	Merge branch 'main' of https://github.com/IDEA-Research/GroundingDINO into main	2023-06-29 14:11:54 +08:00
SlongLiu	a0cc07e12f	support phrase grounding mode	2023-06-29 14:11:35 +08:00
Ren Tianhe	4605649b77	Update README.md	2023-06-28 00:18:24 +08:00
Ren Tianhe	beeb4c29cb	Update README.md	2023-06-20 11:56:10 +08:00
Mohamad Al Mdfaa	9389fa492b	fix: improve phrases2classes implementation (#143 ) This commit improves the phrases2classes implementation by using a regular expression to match sub-phrases in the phrases list. This makes the implementation more accurate and efficient.	2023-06-17 02:36:16 -07:00
Shilong Liu	16292e162d	support coco evaluation (#149 )	2023-06-17 02:31:07 -07:00
Ren Tianhe	4e6f23d35c	Add logo for Grounding-DINO (#144 )	2023-06-13 22:00:19 -07:00
Ren Tianhe	6225f464da	Add logo file	2023-06-14 12:27:55 +08:00
HaoRan-hash	9a96ef055c	Solve combined categories (#125 ) * Update inference.py	2023-06-07 11:48:08 -07:00
Piotr Skalski	31aa788a3c	🛠️ Fixing typos in README.md	2023-05-23 20:25:50 +02:00
Liu, Hao	427aebd59a	<Feat>: use local transformer model (#110 ) <Detail>: <Footer>:	2023-05-22 15:10:04 +08:00
Karim Umar	39b1472457	minor typo in README (#99 ) Co-authored-by: root <root@vmi1286032.contaboserver.net>	2023-05-12 21:43:47 +08:00
Ren Tianhe	654f5e8bf9	Highlight DetGPT	2023-05-10 11:03:10 +08:00
Ren Tianhe	67bb0b634a	Refine README (#89 ) * refine readme * refine	2023-05-06 16:40:39 +08:00
Ren Tianhe	88a8cd6258	Update Citation	2023-05-06 15:56:00 +08:00
Ren Tianhe	db4e6d9680	Merge pull request #87 from darshats/main create "." separated caption	2023-05-04 01:56:40 +08:00
Darshat Shah	168d65d5c4	create "." separated caption	2023-05-03 23:23:32 +05:30
rentainhe	a4dcf5d411	fix bug	2023-05-02 19:41:34 +08:00
Ren Tianhe	0dc5ece5a2	Merge pull request #40 from eltociear/patch-1 Update README.md	2023-05-02 17:40:10 +08:00
Ren Tianhe	55d5f31b70	Merge pull request #77 from darshats/main use model.device when calling legacy predict	2023-05-02 17:36:30 +08:00
Ren Tianhe	562643e178	Merge pull request #79 from pooya-mohammadi/main Move GroundingDINO_SwinB.cfg.py to GroundingDINO_SwinB_cfg.py	2023-05-02 17:33:06 +08:00
pooya-mohammadi	92766784b0	Move GroundingDINO_SwinB.cfg.py to GroundingDINO_SwinB_cfg.py	2023-04-27 23:11:42 +04:30
Darshat Shah	ff94310921	use model.device when calling legacy predict	2023-04-27 12:15:11 +05:30
Ren Tianhe	498048b1b2	Merge pull request #76 from ahmedosman2001/main Updated README.md	2023-04-26 22:36:10 +08:00
ahmedosman2001	d851b00ed0	Merge pull request #1 from ahmedosman2001/ahmedosman2001-patch-1 Updated README.md	2023-04-26 13:58:37 +01:00
ahmedosman2001	b091a5bb20	Updated README.md Improved installation and usage instructions.	2023-04-26 13:53:00 +01:00
Piotr Skalski	da9f1c0751	Bump `supervision` version to `0.6.0`.	2023-04-21 18:37:17 +02:00
Piotr Skalski	95e0123a14	Add link to Accelerate Image Annotation with SAM and Grounding DINO \| Python Tutorial	2023-04-20 21:10:18 +02:00
Dowon	57535c5a79	fix: setup.py TORCH_CUDA_ARCH_LIST (#62 )	2023-04-19 11:36:28 +08:00
SlongLiu	c43cdb3a95	update cvinw readings	2023-04-15 22:31:21 +08:00
SlongLiu	bd61f50091	update tips	2023-04-12 18:40:11 +08:00
SlongLiu	dbe0ad8f21	add readme for explainations	2023-04-12 18:11:40 +08:00
Zekun Zhang	049566bdc9	Fix argument parsing bug (#43 ) text_threshold was wrongly set by args.box_threshold	2023-04-12 17:18:47 +08:00
Luca Medeiros	19e699c635	add init to datasets (#42 )	2023-04-12 13:05:27 +08:00
Ikko Eltociear Ashimine	428ef7fab4	Update README.md Github -> GitHub	2023-04-12 01:09:25 +09:00
Shilong Liu	9dac4c605b	fix windows bugs (#30 )	2023-04-09 22:08:36 +08:00
SlongLiu	3bb2c86c9a	update readme with gd-swinb hf links	2023-04-08 16:52:18 +08:00
SlongLiu	d3bc35fdea	update gligen	2023-04-08 16:38:19 +08:00
SlongLiu	15ade007a8	add grounding dino - B	2023-04-07 17:37:00 +08:00