faiss

mirror of https://github.com/facebookresearch/faiss.git synced 2025-06-03 21:54:02 +08:00

Author	SHA1	Message	Date
Kumar Saurabh Arora	37f6b76fe1	Adding support for index builder (#3800 ) Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3800 In this diff, 1. codec can be referred both using desc name or remote path in IndexFromCodec 2. expose serialization of full index through BuildOperator 3. Rename get_local_filename to get_local_filepath. Reviewed By: satymish Differential Revision: D61813717 fbshipit-source-id: ed422751a1d3712565efa87ecf615620799cb8eb	2024-08-27 10:02:15 -07:00
Kumar Saurabh Arora	80a2462483	Fixing initialization of dictionary in dataclass (#3749 ) Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3749 same as title Reviewed By: satymish Differential Revision: D61133788 fbshipit-source-id: 5761e6347365f7701ee0600a9d895b8bd1f0a6b8	2024-08-12 17:49:43 -07:00
Kumar Saurabh Arora	290464f23b	Adding embedding column to dataset descriptor (#3736 ) Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3736 Nit - adding embedding column in dataset descriptor Nit - initializing cached_ds as part of class instead of post_init Reviewed By: satymish Differential Revision: D60858496 fbshipit-source-id: 3358d866a0668424cd6895bc7a5c620ff97e72fa	2024-08-09 17:07:36 -07:00
Kumar Saurabh Arora	da75d03442	Refactor bench_fw to support train, build & search in parallel (#3527 ) Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3527 Context Design Doc: [Faiss Benchmarking](https://docs.google.com/document/d/1c7zziITa4RD6jZsbG9_yOgyRjWdyueldSPH6QdZzL98/edit) In this diff 1. Be able to reference codec and index from blobstore (bucket & path) outside the experiment 2. To support #1, naming is moved to descriptors. 3. Build index can be written as well. 4. You can run benchmark with train and then refer it in index built and then refer index built in knn search. Index serialization is optional. Although not yet exposed through index descriptor. 5. Benchmark can support index with different datasets sizes 6. Working with varying dataset now support multiple ground truth. There may be small fixes before we could use this. 7. Added targets for bench_fw_range, ivf, codecs and optimize. Analysis of ivf result: D58823037 Reviewed By: algoriddle Differential Revision: D57236543 fbshipit-source-id: ad03b28bae937a35f8c20f12e0a5b0a27c34ff3b	2024-06-21 13:04:09 -07:00
Gergely Szilvasy	1d0e8d489f	index optimizer (#3154 ) Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3154 Using the benchmark to find Pareto optimal indices, in this case on BigANN as an example. Separately optimize the coarse quantizer and the vector codec and use Pareto optimal configurations to construct IVF indices, which are then retested at various scales. See `optimize()` in `optimize.py` as the main function driving the process. The results can be interpreted with `bench_fw_notebook.ipynb`, which allows: * filtering by maximum code size * maximum time * minimum accuracy * space or time Pareto optimal options * and visualize the results and output them as a table. This version is intentionally limited to IVF(Flat\|HNSW),PQ\|SQ indices... Reviewed By: mdouze Differential Revision: D51781670 fbshipit-source-id: 2c0f800d374ea845255934f519cc28095c00a51f	2024-01-30 10:58:13 -08:00
Gergely Szilvasy	beef6107fc	faiss paper benchmarks (#3189 ) Summary: - IVF benchmarks: `bench_fw_ivf.py bench_fw_ivf.py bigann /checkpoint/gsz/bench_fw/ivf` - Codec benchmarks: `bench_fw_codecs.py contriever /checkpoint/gsz/bench_fw/codecs` and `bench_fw_codecs.py deep1b /checkpoint/gsz/bench_fw/codecs` - A range codec evaluation: `bench_fw_range.py ssnpp /checkpoint/gsz/bench_fw/range` - Visualize with `bench_fw_notebook.ipynb` - Support for running on a cluster Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3189 Reviewed By: mdouze Differential Revision: D52544642 Pulled By: algoriddle fbshipit-source-id: 21dcdfd076aef6d36467c908e6be78ef851b0e98	2024-01-05 09:27:04 -08:00
Gergely Szilvasy	9519a19f42	benchmark refactor Summary: 1. Support for index construction parameters outside of the factory string (arbitrary depth of quantizers). 2. Refactor that provides an index wrapper which is a prereq for the optimizer, which will generate indices from pre-optimized components (particularly quantizers) Reviewed By: mdouze Differential Revision: D51427452 fbshipit-source-id: 014d05dd798d856360f2546963e7cad64c2fcaeb	2023-12-04 05:53:17 -08:00
Gergely Szilvasy	c3b9374984	bench_fw - fixes & nits for oss (#3102 ) Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3102 Reviewed By: pemazare Differential Revision: D50426528 Pulled By: algoriddle fbshipit-source-id: 886960b8b522318967fc5ec305666871b496cae8	2023-10-20 07:53:56 -07:00
Gergely Szilvasy	0a00d8137a	offline index evaluation (#3097 ) Summary: Pull Request resolved: https://github.com/facebookresearch/faiss/pull/3097 A framework for evaluating indices offline. Long term objectives: 1. Generate offline similarity index performance data with test datasets both for existing indices and automatically generated alternatives. That is, given a dataset and some constraints this workflow should automatically discover optimal index types and parameter choices as well as evaluate the performance of existing production indices and their parameters. 2. Allow researchers, platform owners (Laser, Unicorn) and product teams to understand how different index types perform on their datasets and make optimal choices wrt their objectives. Longer term to enable automatic decision-making/auto-tuning. Constraints, design choices: 1. I want to run the same evaluation on Meta-internal (fblearner, data from hive and manifold) or the local machine + research cluster (data on local disk or NFS) via OSS Faiss. Via fblearner, I want this to work in a way that it can be turned into a service and plugged into Unicorn or Laser, while the core Faiss part can be used/referred to in our research and to update the wiki with the latest results/recommendations for public datasets. 2. It must support a range of metrics for KNN and range search, and it should be easy to add new ones. Cost metrics need to be fine-grained to allow extrapolation. 3. It should automatically sweep all query time params (eg. nprobe, polysemous code hamming distance, params of quantizers), using`OperatingPointsWithRanges` to cut down the optimal param search space. (For now, it sweeps nprobes only.) 4. [FUTURE] It will generate/sweep index creation hyperparams (factory strings, quantizer sizes, quantizer params), using heuristics. 5. [FUTURE] It will sweep the dataset size: start small test with e.g. 100K db vectors and go up to millions, billions potentially, while narrowing down the index+param choices at each step. 6. [FUTURE] Extrapolate perf metrics (cost and accuracy) 7. Intermediate results must be saved (to disk, to manifold) throughout, and reused as much as possible to cut down on overall runtime and enable faster iteration during development. For range search, this diff supports the metric proposed in https://docs.google.com/document/d/1v5OOj7kfsKJ16xzaEHuKQj12Lrb-HlWLa_T2ct0LJiw/edit?usp=sharing I also added support for the classical case where the scoring function steps from 1 to 0 at some arbitrary threshold. For KNN, I added knn_intersection, but other metrics, particularly recall@1 will also be interesting. I also added the distance_ratio metric, which we previously discussed as an interesting alternative, since it shows how much the returned results approximate the ground-truth nearest-neighbours in terms of distances. In the test case, I evaluated three current production indices for VCE with 1M vectors in the database and 10K queries. Each index is tested at various operating points (nprobes), which are shows on the charts. The results are not extrapolated to the true scale of these indices. Reviewed By: yonglimeta Differential Revision: D49958434 fbshipit-source-id: f7f567b299118003955dc9e2d9c5b971e0940fc5	2023-10-17 13:56:02 -07:00

9 Commits