to download the data to subdirectory deep1b/, then concatenate the
database files to base.fvecs and the training files to learn.fvecs
### Running the experiments
These experiments are quite long. To support resuming, the script
stores the result of raining to a temporary directory, `/tmp/bench_polysemous`.
The script `bench_polysemous_1bn.py` takes at leas two arguments:
- the dataset name: SIFT1000M (aka SIFT1B, aka BIGANN) or Deep1B. SIFT1M, SIFT2M,... are also supported to make subsets of for small experiments (note that SIFT1M is not the same as the SIFT1M above)
- the type of index to build, which should be a valid [index_factory key](https://github.com/facebookresearch/faiss/wiki/High-level-interface-and-auto-tuning#index-factory) (see below for examples)
- the remaining arguments are parsed as search-time parameters.
### Experiments of Table 2
The `IMI*+PolyD+ADC` results in Table 2 can be reproduced with (for 16 bytes):
The result reported in table 2 is the one for which the %pass (percentage of code comparisons that pass the Hamming check) is around 20%, which occurs for Hamming threshold `ht=48`.
The 8-byte results can be reproduced with the factory key `IMI2x12,PQ8`
### Experiments of the appendix
The experiments in the appendix are only in the ArXiv version of the paper (table 3).
The benchmarks below run 1 or 4 Titan X GPUs and reproduce the results of the "GPU paper". They are also a good starting point on how to use GPU Faiss.
See above on how to get SIFT1M into subdirectory sift1M/. The script [`bench_gpu_sift1m.py`](bench_gpu_sift1m.py) reproduces the "exact k-NN time" plot in the ArXiv paper, and the SIFT1M numbers.
The output is:
```
============ Exact search
add vectors to index
warmup
benchmark
k=1 0.715 s, R@1 0.9914
k=2 0.729 s, R@1 0.9935
k=4 0.731 s, R@1 0.9935
k=8 0.732 s, R@1 0.9935
k=16 0.742 s, R@1 0.9935
k=32 0.737 s, R@1 0.9935
k=64 0.753 s, R@1 0.9935
k=128 0.761 s, R@1 0.9935
k=256 0.799 s, R@1 0.9935
k=512 0.975 s, R@1 0.9935
k=1024 1.424 s, R@1 0.9935
============ Approximate search
train
WARNING clustering 100000 points to 4096 centroids: please provide at least 159744 training points
add vectors to index
WARN: increase temp memory to avoid cudaMalloc, or decrease query/add size (alloc 256000000 B, highwater 256000000 B)
warmup
benchmark
nprobe= 1 0.043 s recalls= 0.3909 0.4312 0.4312
nprobe= 2 0.040 s recalls= 0.5041 0.5636 0.5636
nprobe= 4 0.048 s recalls= 0.6048 0.6897 0.6897
nprobe= 8 0.064 s recalls= 0.6879 0.8028 0.8028
nprobe= 16 0.088 s recalls= 0.7534 0.8940 0.8940
nprobe= 32 0.134 s recalls= 0.7957 0.9549 0.9550
nprobe= 64 0.224 s recalls= 0.8125 0.9833 0.9834
nprobe= 128 0.395 s recalls= 0.8205 0.9953 0.9954
nprobe= 256 0.717 s recalls= 0.8227 0.9993 0.9994
nprobe= 512 1.348 s recalls= 0.8228 0.9999 1.0000
```
The run produces two warnings:
- the clustering complains that it does not have enough training data, there is not much we can do about this.
- the add() function complains that there is an inefficient memory allocation, but this is a concern only when it happens often, and we are not benchmarking the add time anyways.
To index small datasets, it is more efficient to use a `GpuIVFFlat`, which just stores the full vectors in the inverted lists. We did not mention this in the the paper because it is not as scalable. To experiment with this setting, change the `index_factory` string from "IVF4096,PQ64" to "IVF16384,Flat". This gives:
To get the "infinite MNIST dataset", follow the instructions on [Léon Bottou's website](http://leon.bottou.org/projects/infimnist). The script assumes the file `mnist8m-patterns-idx3-ubyte` is in subdirectory `mnist8m`
The script [`kmeans_mnist.py`](kmeans_mnist.py) produces the following output:
```
python kmeans_mnist.py 1 256
...
Clustering 8100000 points in 784D to 256 clusters, redo 1 times, 20 iterations
The script [`bench_gpu_1bn.py`](bench_gpu_1bn.py) runs multi-gpu searches on the two 1-billion vector datasets we considered. It is more complex than the previous scripts, because it supports many search options and decomposes the dataset build process in Python to exploit the best possible CPU/GPU parallelism and GPU distribution.