pgvector扩展在大规模应用中的挑战与实践

https://simonwillison.net/atom/everything 前天 04:40

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文深入探讨了在实际应用中大规模运行pgvector PostgreSQL向量索引扩展所面临的挑战。文章重点关注了使用IVFFlat或HNSW索引类型，在近乎实时更新的情况下维护大型索引的困难。其中，关于预过滤与后过滤的讨论尤为关键，揭示了这一选择对查询性能和结果准确性的巨大影响。此外，文章还引用了Discourse团队在大规模生产环境中使用pgvector的经验，包括其如何通过量化技术（如16位浮点数存储和二进制向量索引）优化存储成本和性能，以及pgvector在“相关话题”、“标签建议”、“增强搜索”和“文件RAG”等功能中的应用。

💡 **大规模索引维护的挑战**: 在近乎实时更新的环境下，使用IVFFlat或HNSW等索引类型维护大型pgvector索引面临诸多挑战，尤其是在处理大量数据时，索引的效率和一致性至关重要。

🚀 **预过滤与后过滤的性能差异**: 对于带有元数据的向量搜索，选择先进行元数据过滤（预过滤）还是先进行向量搜索再过滤（后过滤），对查询响应时间和结果准确性有着决定性的影响，前者通常能带来数量级的性能提升。

📊 **量化技术优化存储与性能**: Discourse团队通过广泛采用量化技术，如使用16位浮点数（halfvec）进行存储和二进制向量（bit）用于索引，显著降低了存储成本并提升了查询性能，使得pgvector能够在其海量数据库中广泛应用。

📚 **pgvector在Discourse的实际应用**: pgvector在Discourse的生产环境中扮演着关键角色，支撑着“相关话题”推荐、新话题的标签和分类建议、增强搜索功能，以及为上传文件提供检索增强（RAG）能力，覆盖了绝大多数页面浏览量。

The case against pgvector (via) I wasn't keen on the title of this piece but the content is great: Alex Jacobs talks through lessons learned trying to run the popular pgvector PostgreSQL vector indexing extension at scale, in particular the challenges involved in maintaining a large index with close-to-realtime updates using the IVFFlat or HNSW index types.

The section on pre-v.s.-post filtering is particularly useful:

Okay but let's say you solve your index and insert problems. Now you have a document search system with millions of vectors. Documents have metadata---maybe they're marked as draft, published, or archived. A user searches for something, and you only want to return published documents.
[...] should Postgres filter on status first (pre-filter) or do the vector search first and then filter (post-filter)?
This seems like an implementation detail. It’s not. It’s the difference between queries that take 50ms and queries that take 5 seconds. It’s also the difference between returning the most relevant results and… not.

The Hacker News thread for this article attracted a robust discussion, including some fascinating comments by Discourse developer Rafael dos Santos Silva (xfalcox) about how they are using pgvector at scale:

We [run pgvector in production] at Discourse, in thousands of databases, and it's leveraged in most of the billions of page views we serve. [...]
Also worth mentioning that we use quantization extensively:
halfvec (16bit float) for storage - bit (binary vectors) for indexes
Which makes the storage cost and on-going performance good enough that we could enable this in all our hosting. [...]
In Discourse embeddings power:
Related Topics, a list of topics to read next, which uses embeddings of the current topic as the key to search for similar onesSuggesting tags and categories when composing a new topicAugmented searchRAG for uploaded files

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签