History

littletomatodonkey 77557082d4 add support to append index (#1006 )		2021-07-02 18:48:37 +08:00
..
src	update vector search license (#784 )	2021-06-04 13:50:49 +08:00
Makefile	fix so make in windows (#849 )	2021-06-16 15:43:52 +08:00
README.md	fix so make in windows (#849 )	2021-06-16 15:43:52 +08:00
__init__.py	Update __init__.py	2021-06-01 17:57:21 +08:00
index.dll	fix so make in windows (#849 )	2021-06-16 15:43:52 +08:00
index.so	fix so make in windows (#849 )	2021-06-16 15:43:52 +08:00
interface.cc	update vector search license (#784 )	2021-06-04 13:50:49 +08:00
interface.py	add support to append index (#1006 )	2021-07-02 18:48:37 +08:00
test.py	Update test.py	2021-06-01 14:18:44 +08:00

README.md

向量检索

1. 简介

一些垂域识别任务（如车辆、商品等）需要识别的类别数较大，往往采用基于检索的方式，通过查询向量与底库向量进行快速的最近邻搜索，获得匹配的预测类别。向量检索模块提供基础的近似最近邻搜索算法，基于百度自研的Möbius算法，一种基于图的近似最近邻搜索算法，用于最大内积搜索 (MIPS)。该模块提供python接口，支持numpy和 tensor类型向量,支持L2和Inner Product距离计算。

Mobius 算法细节详见论文（Möbius Transformation for Fast Inner Product Search on Graph, Code）

2. 安装

2.1 直接使用提供的库文件

该文件夹下有已经编译好的index.so(gcc8.2.0下编译，用于Linux)以及index.dll(gcc10.3.0下编译，用于Windows)，可以跳过2.2与2.3节，直接使用。

如果因为gcc版本过低或者环境不兼容的问题，导致库文件无法使用，则需要在不同的平台下手动编译库文件。

注意： 请确保您的 C++ 编译器支持 C++11 标准。

2.2 Linux上编译生成库文件

运行下面的命令，安装gcc与g++。

sudo apt-get update
sudo apt-get upgrade -y
sudo apt-get install build-essential gcc g++

可以通过命令gcc -v查看gcc版本。

进入该文件夹，直接运行make即可，如果希望重新生成index.so文件，可以首先使用make clean清除已经生成的缓存，再使用make生成更新之后的库文件。

2.3 Windows上编译生成库文件

Windows上首先需要安装gcc编译工具，推荐使用TDM-GCC，进入官网之后，可以选择合适的版本进行下载。推荐下载tdm64-gcc-10.3.0-2.exe。

下载完成之后，按照默认的安装步骤进行安装即可。这里有3点需要注意：

向量检索模块依赖于openmp，因此在安装到choose components步骤的时候，需要勾选上openmp的安装选项，否则之后编译的时候会报错libgomp.spec: No such file or directory，参考链接
安装过程中会提示是否需要添加到系统的环境变量中，这里建议勾选上，否则之后使用的时候还需要手动添加系统环境变量。
Linux上的编译命令为make，Windows上为mingw32-make，这里需要区分一下。

安装完成后，可以打开一个命令行终端，通过命令gcc -v查看gcc版本。

在该文件夹下，运行命令mingw32-make，即可生成index.dll库文件。如果希望重新生成index.dll文件，可以首先使用mingw32-make clean清除已经生成的缓存，再使用mingw32-make生成更新之后的库文件。

3. 快速使用

import numpy as np
from interface import Graph_Index

# 随机产生样本
index_vectors = np.random.rand(100000,128).astype(np.float32)
query_vector = np.random.rand(128).astype(np.float32)
index_docs = ["ID_"+str(i) for i in range(100000)]

# 初始化索引结构
indexer = Graph_Index(dist_type="IP") #支持"IP"和"L2"
indexer.build(gallery_vectors=index_vectors, gallery_docs=index_docs, pq_size=100, index_path='test')

# 查询
scores, docs = indexer.search(query=query_vector, return_k=10, search_budget=100)
print(scores)
print(docs)

# 保存与加载
indexer.dump(index_path="test")
indexer.load(index_path="test")

README.md Unescape Escape