Commit Graph

9 Commits (21dfdbaaa0e30f2e16ad98ae4f94c2952e7178ce)

Author SHA1 Message Date
Kumar Saurabh Arora 37f6b76fe1 Adding support for index builder (#3800)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3800

In this diff,
1. codec can be referred both using desc name or remote path in IndexFromCodec
2. expose serialization of full index through BuildOperator
3. Rename get_local_filename to get_local_filepath.

Reviewed By: satymish

Differential Revision: D61813717

fbshipit-source-id: ed422751a1d3712565efa87ecf615620799cb8eb
2024-08-27 10:02:15 -07:00
Kumar Saurabh Arora 80a2462483 Fixing initialization of dictionary in dataclass (#3749)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3749

same as title

Reviewed By: satymish

Differential Revision: D61133788

fbshipit-source-id: 5761e6347365f7701ee0600a9d895b8bd1f0a6b8
2024-08-12 17:49:43 -07:00
Kumar Saurabh Arora 290464f23b Adding embedding column to dataset descriptor (#3736)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3736

Nit - adding embedding column in dataset descriptor
Nit - initializing cached_ds as part of class instead of post_init

Reviewed By: satymish

Differential Revision: D60858496

fbshipit-source-id: 3358d866a0668424cd6895bc7a5c620ff97e72fa
2024-08-09 17:07:36 -07:00
Kumar Saurabh Arora da75d03442 Refactor bench_fw to support train, build & search in parallel (#3527)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3527

**Context**
Design Doc: [Faiss Benchmarking](https://docs.google.com/document/d/1c7zziITa4RD6jZsbG9_yOgyRjWdyueldSPH6QdZzL98/edit)

**In this diff**
1. Be able to reference codec and index from blobstore (bucket & path) outside the experiment
2. To support #1, naming is moved to descriptors.
3. Build index can be written as well.
4. You can run benchmark with train and then refer it in index built and then refer index built in knn search. Index serialization is optional. Although not yet exposed through index descriptor.
5. Benchmark can support index with different datasets sizes
6. Working with varying dataset now support multiple ground truth. There may be small fixes before we could use this.
7. Added targets for bench_fw_range, ivf, codecs and optimize.

**Analysis of ivf result**: D58823037

Reviewed By: algoriddle

Differential Revision: D57236543

fbshipit-source-id: ad03b28bae937a35f8c20f12e0a5b0a27c34ff3b
2024-06-21 13:04:09 -07:00
Gergely Szilvasy 1d0e8d489f index optimizer (#3154)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3154

Using the benchmark to find Pareto optimal indices, in this case on BigANN as an example.

Separately optimize the coarse quantizer and the vector codec and use Pareto optimal configurations to construct IVF indices, which are then retested at various scales. See `optimize()` in `optimize.py` as the main function driving the process.

The results can be interpreted with `bench_fw_notebook.ipynb`, which allows:
* filtering by maximum code size
* maximum time
* minimum accuracy
* space or time Pareto optimal options
* and visualize the results and output them as a table.

This version is intentionally limited to IVF(Flat|HNSW),PQ|SQ indices...

Reviewed By: mdouze

Differential Revision: D51781670

fbshipit-source-id: 2c0f800d374ea845255934f519cc28095c00a51f
2024-01-30 10:58:13 -08:00
Gergely Szilvasy beef6107fc faiss paper benchmarks (#3189)
Summary:
- IVF benchmarks: `bench_fw_ivf.py bench_fw_ivf.py bigann /checkpoint/gsz/bench_fw/ivf`
- Codec benchmarks: `bench_fw_codecs.py contriever /checkpoint/gsz/bench_fw/codecs` and `bench_fw_codecs.py deep1b /checkpoint/gsz/bench_fw/codecs`
- A range codec evaluation: `bench_fw_range.py ssnpp /checkpoint/gsz/bench_fw/range`
- Visualize with `bench_fw_notebook.ipynb`
- Support for running on a cluster

Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3189

Reviewed By: mdouze

Differential Revision: D52544642

Pulled By: algoriddle

fbshipit-source-id: 21dcdfd076aef6d36467c908e6be78ef851b0e98
2024-01-05 09:27:04 -08:00
Gergely Szilvasy 9519a19f42 benchmark refactor
Summary:
1. Support for index construction parameters outside of the factory string (arbitrary depth of quantizers).
2. Refactor that provides an index wrapper which is a prereq for the optimizer, which will generate indices from pre-optimized components (particularly quantizers)

Reviewed By: mdouze

Differential Revision: D51427452

fbshipit-source-id: 014d05dd798d856360f2546963e7cad64c2fcaeb
2023-12-04 05:53:17 -08:00
Gergely Szilvasy c3b9374984 bench_fw - fixes & nits for oss (#3102)
Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3102

Reviewed By: pemazare

Differential Revision: D50426528

Pulled By: algoriddle

fbshipit-source-id: 886960b8b522318967fc5ec305666871b496cae8
2023-10-20 07:53:56 -07:00
Gergely Szilvasy 0a00d8137a offline index evaluation (#3097)
Summary:
Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3097

A framework for evaluating indices offline.

Long term objectives:
1. Generate offline similarity index performance data with test datasets both for existing indices and automatically generated alternatives. That is, given a dataset and some constraints this workflow should automatically discover optimal index types and parameter choices as well as evaluate the performance of existing production indices and their parameters.
2. Allow researchers, platform owners (Laser, Unicorn) and product teams to understand how different index types perform on their datasets and make optimal choices wrt their objectives. Longer term to enable automatic decision-making/auto-tuning.

Constraints, design choices:
1. I want to run the same evaluation on Meta-internal (fblearner, data from hive and manifold) or the local machine + research cluster (data on local disk or NFS) via OSS Faiss. Via fblearner, I want this to work in a way that it can be turned into a service and plugged into Unicorn or Laser, while the core Faiss part can be used/referred to in our research and to update the wiki with the latest results/recommendations for public datasets.
2. It must support a range of metrics for KNN and range search, and it should be easy to add new ones. Cost metrics need to be fine-grained to allow extrapolation.
3. It should automatically sweep all query time params (eg. nprobe, polysemous code hamming distance, params of quantizers), using`OperatingPointsWithRanges` to cut down the optimal param search space. (For now, it sweeps nprobes only.)
4. [FUTURE] It will generate/sweep index creation hyperparams (factory strings, quantizer sizes, quantizer params), using heuristics.
5. [FUTURE] It will sweep the dataset size: start small test with e.g. 100K db vectors and go up to millions, billions potentially, while narrowing down the index+param choices at each step.
6. [FUTURE] Extrapolate perf metrics (cost and accuracy)
7. Intermediate results must be saved (to disk, to manifold) throughout, and reused as much as possible to cut down on overall runtime and enable faster iteration during development.

For range search, this diff supports the metric proposed in https://docs.google.com/document/d/1v5OOj7kfsKJ16xzaEHuKQj12Lrb-HlWLa_T2ct0LJiw/edit?usp=sharing I also added support for the classical case where the scoring function steps from 1 to 0 at some arbitrary threshold.

For KNN, I added knn_intersection, but other metrics, particularly recall@1 will also be interesting. I also added the distance_ratio metric, which we previously discussed as an interesting alternative, since it shows how much the returned results approximate the ground-truth nearest-neighbours in terms of distances.

In the test case, I evaluated three current production indices for VCE with 1M vectors in the database and 10K queries. Each index is tested at various operating points (nprobes), which are shows on the charts. The results are not extrapolated to the true scale of these indices.

Reviewed By: yonglimeta

Differential Revision: D49958434

fbshipit-source-id: f7f567b299118003955dc9e2d9c5b971e0940fc5
2023-10-17 13:56:02 -07:00