11 Commits

Author SHA1 Message Date
Michael Norris
835b3ea1bd Fix IVF quantizer centroid sharding so IDs are generated (#4197)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/4197

Ivan and I discussed 2 problems:
1. We may want to try to offload/shard PQ or SQ table data if there is a big enough win (pending)
2. IDs seem to be random after sharding.

This diff solves 2.

Root cause is that we add to quantizer without IDs.

Instead, we wrap in IndexIDMap2 (which provides reconstruction, whereas IndexIDMap does not).

Laser's quantizers are Flat and HNSW, so we can wrap like this.

Reviewed By: ivansopin

Differential Revision: D69832788

fbshipit-source-id: 331b6d1cf52666f5dac61e2b52302d46b0a83708
2025-02-24 16:01:08 -08:00
Michael Norris
aff6bfcd80 Add sharding convenience function for IVF indexes (#4150)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/4150

Creates a sharding convenience function for IVF indexes.
- The __**centroids on the quantizer**__ are sharded based on the given sharding function. (not the data, as data sharding by ids is already implemented by copy_subuset_to, https://github.com/facebookresearch/faiss/blob/main/faiss/IndexIVF.h#L408)
- The output is written to files based on the template filename generator param.
- The default sharding function is simply the ith vector mod the total shard count.

This would called by Laser here: https://www.internalfb.com/code/fbsource/[ce1f2e028e79]/fbcode/fblearner/flow/projects/laser/laser_sim_search/knn_trainer.py?lines=295-296. This convenience function will do the file writing, and return the created file names.

There's a few key required changes in FAISS:
1. Allow `std::vector<std::string>` to be used. Updates swigfaiss.swig and array_conversions.py to accommodate. These have to be numpy dtype of `object` instead of the more correct `unicode`, because unicode dtype is fixed length. I couldn't figure out how to create a numpy array with each of the output file names where they have different dtypes. (Say the file names are like file1, file11, file111. The dtype would need to be U5, U6, U7 respectively, as the dtype for unicode contains the length). I tried structured arrays : this does not work either, as numpy makes it into a matrix instead: the `file1 file11 file111` example with explicit setting of U5, U6, U7 turns into `[[file1 file1 file1], [file1 file11 file11], [file1 file11 file111]]`, which we do not want. If someone knows the right syntax, please yell at me
2. Create Python callbacks for sharding and template filename: `PyCallbackFilenameTemplateGenerator` and `PyCallbackShardingFunction`. Users of this function would inherit from the FilenameTemplateGenerator or ShardingFunction in C++ to pass to `shard_ivf_index_centroids`. See the other examples in python_callbacks.cpp. This is required because Python functions cannot be passed through SWIG to C++ (i.e. no std::function or function pointers), so we have to use this approach. This approach allows it to be called from both C++ and Python. test_sharding.py shows the Python calling, test_utils.cpp shows the C++ calling.

Reviewed By: asadoughi

Differential Revision: D68534991

fbshipit-source-id: b857e20c6cc4249a2ab7792db4c93dd4fb8403fd
2025-02-07 11:39:59 -08:00
Michael Norris
eff0898a13 Enable linting: lint config changes plus arc lint command (#3966)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3966

This actually enables the linting.

Manual changes:
- tools/arcanist/lint/fbsource-licenselint-config.toml
- tools/arcanist/lint/fbsource-lint-engine.toml

Automated changes:
`arc lint --apply-patches --take LICENSELINT --paths-cmd 'hg files faiss'`

Reviewed By: asadoughi

Differential Revision: D64484165

fbshipit-source-id: 4f2f6e953c94ef6ebfea8a5ae035ccfbea65ed04
2024-10-22 09:46:48 -07:00
Xiao Fu
5e452ed52a Cleaning up more unnecessary print (#3455)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3455

Code quality control by reducing the number of prints

Reviewed By: junjieqi

Differential Revision: D57502194

fbshipit-source-id: a6cd65ed4cc49590ce73d2978d41b640b5259c17
2024-05-17 16:59:36 -07:00
Matthijs Douze
8fc3775472 building blocks for hybrid CPU / GPU search (#2638)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/2638

This diff is a more streamlined way of searching IVF indexes with precomputed clusters.
This will be used for experiments with hybrid CPU / GPU search.

Reviewed By: algoriddle

Differential Revision: D41301032

fbshipit-source-id: a1d645fd0f2bf806454dfd04971edc0a6200d20d
2023-01-12 13:34:44 -08:00
Check Deng
55c93f3cde Handle the situation where nprobe > nlist in IndexBinaryIVF (#1695)
Summary:
## Description

It is the same as https://github.com/facebookresearch/faiss/pull/1673 but for `IndexBinaryIVF`. Ensure that `nprobe` is no more than `nlist`.

## Changes
1. Replace `nprobe` with `min(nprobe, nlist)`
2. Replace `long` with `idx_t` in `IndexBinaryIVF.cpp`
3. Add a unit test
4. Fix a small bug in https://github.com/facebookresearch/faiss/pull/1673, `index` should be replaced by `gt_index`

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1695

Reviewed By: wickedfoo

Differential Revision: D26603278

Pulled By: mdouze

fbshipit-source-id: a4fb79bdeb975e9d8ec507177596c36da1195646
2021-02-23 12:20:37 -08:00
Chengqi Deng
b4a0a9c617 Handle the situation where nprobe > nlist in IndexIVF (#1673)
Summary:
## Description

Fix the bug mentioned in https://github.com/facebookresearch/faiss/issues/1010. When `nprobe` is greater than `nlist` in `IndexIVF`, the program will crash because the index will ask the quantizer to return more centroids than it owns.

## Changes:
1. Set `nprobe` as `nlist` if it is greater than `nlist` during searching.
2. Add one test to detect this bug.
3. Fix typo in `IndexPQ.cpp`.

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1673

Reviewed By: wickedfoo

Differential Revision: D26454420

Pulled By: mdouze

fbshipit-source-id: d1d0949e30802602e975a94ba873f9db29abd5ab
2021-02-16 09:54:23 -08:00
Matthijs Douze
5ad630635c expose threat-safe stats (#1438)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/1438

This diff changes Faiss and the `combined_index.py` to propagate thread-safe stats to handler.py

Reviewed By: MDSilber

Differential Revision: D24082543

fbshipit-source-id: 944e6b7630daeede5eb9501b81557a6fe5afec44
2020-10-03 23:26:36 -07:00
Lucas Hosseini
22b7876ef5
Facebook sync (2020-03-10) (#1136) 2020-03-10 14:24:07 +01:00
Lucas Hosseini
a8118acbc5
Facebook sync (May 2019) + relicense (#838)
Changelog:

- changed license: BSD+Patents -> MIT
- propagates exceptions raised in sub-indexes of IndexShards and IndexReplicas
- support for searching several inverted lists in parallel (parallel_mode != 0)
- better support for PQ codes where nbit != 8 or 16
- IVFSpectralHash implementation: spectral hash codes inside an IVF
- 6-bit per component scalar quantizer (4 and 8 bit were already supported)
- combinations of inverted lists: HStackInvertedLists and VStackInvertedLists
- configurable number of threads for OnDiskInvertedLists prefetching (including 0=no prefetch)
- more test and demo code compatible with Python 3 (print with parentheses)
- refactored benchmark code: data loading is now in a single file
2019-05-28 16:17:22 +02:00
Lucas Hosseini
323dbf3be3
Facebook sync (Dec 2018). (#660)
* Add GpuIndexBinaryFlat
* Add IndexBinaryHNSW
2018-12-19 17:48:35 +01:00