faiss/benchs/distributed_ondisk/README.md

# Distributed on-disk index for 1T-scale datasets

This is code corresponding to the description in [Indexing 1T vectors](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors).
All the code is in python 3 (and not compatible with Python 2).
The current code uses the Deep1B dataset for demonstration purposes, but can scale to 1000x larger.
To run it, download the Deep1B dataset as explained [here](../#getting-deep1b), and edit paths to the dataset in the scripts.

The cluster commands are written for the Slurm batch scheduling system.
Hopefully, changing to another type of scheduler should be quite straightforward.

## Distributed k-means

To cluster 500M vectors to 10M centroids, it is useful to have a distributed k-means implementation.
The distribution simply consists in splitting the training vectors across machines (servers) and have them do the assignment.
The master/client then synthesizes the results and updates the centroids.

The distributed k-means implementation here is based on 3 files:

- [`distributed_kmeans.py`](distributed_kmeans.py) contains the k-means implementation.
The main loop of k-means is re-implemented in python but follows closely the Faiss C++ implementation, and should not be significantly less efficient.
It relies on a `DatasetAssign` object that does the assignment to centroids, which is the bulk of the computation.
The object can be a Faiss CPU index, a GPU index or a set of remote GPU or CPU indexes.

- [`run_on_cluster.bash`](run_on_cluster.bash) contains the shell code to run the distributed k-means on a cluster.

The distributed k-means works with a Python install that contains faiss and scipy (for sparse matrices).
It clusters the training data of Deep1B, this can be changed easily to any file in fvecs, bvecs or npy format that contains the training set.
The training vectors may be too large to fit in RAM, but they are memory-mapped so that should not be a problem.
The file is also assumed to be accessible from all server machines with eg. a distributed file system.

### Local tests

Edit `distributed_kmeans.py` to point `testdata` to your local copy of the dataset.

Then, 4 levels of sanity check can be run:
```bash
# reference Faiss C++ run
python distributed_kmeans.py --test 0
# using the Python implementation
python distributed_kmeans.py --test 1
# use the dispatch object (on local datasets)
python distributed_kmeans.py --test 2
# same, with GPUs
python distributed_kmeans.py --test 3
```
The output should look like [This gist](https://gist.github.com/mdouze/ffa01fe666a9325761266fe55ead72ad).

### Distributed sanity check

To run the distributed k-means, `distributed_kmeans.py` has to be run both on the servers (`--server` option) and client sides (`--client` option).
Edit the top of `run_on_cluster.bash` to set the path of the data to cluster.

Sanity checks can be run with
```bash
# non distributed baseline
bash run_on_cluster.bash test_kmeans_0
# using all the machine's GPUs
bash run_on_cluster.bash test_kmeans_1
# distributed run, with one local server per GPU
bash run_on_cluster.bash test_kmeans_2
```
The test `test_kmeans_2` simulates a distributed run on a single machine by starting one server process per GPU and connecting to the servers via the rpc protocol.
The output should look like [this gist](https://gist.github.com/mdouze/5b2dc69b74579ecff04e1686a277d32e).


### Distributed run

The way the script can be distributed depends on the cluster's scheduling system.
Here we use Slurm, but it should be relatively easy to adapt to any scheduler that can allocate a set of machines and start the same executable on all of them.

The command
```bash
bash run_on_cluster.bash slurm_distributed_kmeans
```
asks SLURM for 5 machines with 4 GPUs each with the `srun` command.
All 5 machines run the script with the `slurm_within_kmeans_server` option.
They determine the number of servers and their own server id via the `SLURM_NPROCS` and `SLURM_PROCID` environment variables.

All machines start `distributed_kmeans.py` in server mode for the slice of the dataset they are responsible for.

In addition, the machine #0 also starts the client.
The client knows who are the other servers via the variable `SLURM_JOB_NODELIST`.
It connects to all clients and performs the clustering.

The output should look like [this gist](https://gist.github.com/mdouze/8d25e89fb4af5093057cae0f917da6cd).

### Run used for deep1B

For the real run, we run the clustering on 50M vectors to 1M centroids.
This is just a matter of using as many machines / GPUs as possible in setting the output centroids with the `--out filename` option.
Then run
```bash
bash run_on_cluster.bash deep1b_clustering
```

The last lines of output read like:
```bash
  Iteration 19 (898.92 s, search 875.71 s): objective=1.33601e+07 imbalance=1.303 nsplit=0
 0: writing centroids to /checkpoint/matthijs/ondisk_distributed/1M_centroids.npy
```

This means that the total training time was 899s, of which 876s were used for computation.
However, the computation includes the I/O overhead to the assignment servers.
In this implementation, the overhead of transmitting the data is non-negligible and so is the centroid computation stage.
This is due to the inefficient Python implementation and the RPC protocol that is not optimized for broadcast / gather (like MPI).
However, it is a simple implementation that should run on most clusters.

## Making the trained index

After the centroids are obtained, an empty trained index must be constructed.
This is done by:

- applying a pre-processing stage (a random rotation) to balance the dimensions of the vectors. This can be done after clustering, the clusters are just rotated as well.

- wrapping the centroids into a HNSW index to speed up the CPU-based assignment of vectors

- training the 6-bit scalar quantizer used to encode the vectors

This is performed by the script [`make_trained_index.py`](make_trained_index.py).

## Building the index by slices

We call the slices "vslices" as they are vertical slices of the big matrix, see explanation in the wiki section [Split across database partitions](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors#split-across-database-partitions).

The script [make_index_vslice.py](make_index_vslice.py) makes an index for a subset of the vectors of the input data and stores it as an independent index.
There are 200 slices of 5M vectors each for Deep1B.
It can be run in a brute-force parallel fashion, there is no constraint on ordering.
To run the script in parallel on a slurm cluster, use:
```bash
bash run_on_cluster.bash make_index_vslices
```
For a real dataset, the data would be read from a DBMS.
In that case, reading the data and indexing it in parallel is worthwhile because reading is very slow.

## Splitting across inverted lists

The 200 slices need to be merged together.
This is done with the script [merge_to_ondisk.py](merge_to_ondisk.py), that memory maps the 200 vertical slice indexes, extracts a subset of the inverted lists and writes them to a contiguous horizontal slice.
We slice the inverted lists into 50 horizontal slices.
This is run with
```bash
bash run_on_cluster.bash make_index_hslices
```

## Querying the index

At this point the index is ready.
The horizontal slices need to be loaded in the right order and combined into an index to be usable.
This is done in the [combined_index.py](combined_index.py) script.
It provides a `CombinedIndexDeep1B` object that contains an index object that can be searched.
To test, run:
```bash
python combined_index.py
```
The output should look like:
```bash
(faiss_1.5.2) matthijs@devfair0144:~/faiss_versions/faiss_1Tcode/faiss/benchs/distributed_ondisk$ python combined_index.py
reading /checkpoint/matthijs/ondisk_distributed//hslices/slice49.faissindex
loading empty index /checkpoint/matthijs/ondisk_distributed/trained.faissindex
replace invlists
loaded index of size  1000000000
nprobe=1 1-recall@1=0.2904 t=12.35s
nnprobe=10 1-recall@1=0.6499 t=17.67s
nprobe=100 1-recall@1=0.8673 t=29.23s
nprobe=1000 1-recall@1=0.9132 t=129.58s
```
ie. searching is a lot slower than from RAM.

## Distributed query

To reduce the bandwidth required from the machine that does the queries, it is possible to split the search across several search servers.
This way, only the effective results are returned to the main machine.

The search client and server are implemented in [`search_server.py`](search_server.py).
It can be used as a script to start a search server for `CombinedIndexDeep1B` or as a module to load the clients.

The search servers can be started with
```bash
bash run_on_cluster.bash run_search_servers
```
(adjust to the number of servers that can be used).

Then an example of search client is [`distributed_query_demo.py`](distributed_query_demo.py).
It connects to the servers and assigns subsets of inverted lists to visit to each of them.

A typical output is [this gist](https://gist.github.com/mdouze/1585b9854a9a2437d71f2b2c3c05c7c5).
The number in MiB indicates the amount of data that is read from disk to perform the search.
In this case, the scale of the dataset is too small for the distributed search to have much impact, but on datasets > 10x larger, the difference becomes more significant.

## Conclusion

This code contains the core components to make an index that scales up to 1T vectors.
There are a few simplifications wrt. the index that was effectively used in [Indexing 1T vectors](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors).