fix(core): L2-normalize FastEmbed vectors to satisfy unit-vector contract#843
Conversation
…ract sqlite_search_repository uses the formula `similarity = 1 - distance²/2` which is only correct for unit-normalized vectors (see comment on lines 65-67 of sqlite_search_repository.py). Models such as paraphrase-multilingual-MiniLM-L12-v2 return vectors with norm ≈ 2.9, causing cosine similarity scores to collapse near 0 for every query. Apply L2 normalization in `_embed_batch()` after converting to Python floats, so the contract holds regardless of which model is configured. Zero vectors are returned as-is to avoid division errors. Adds two unit tests: one that asserts the output norm is 1.0 for an unnormalized stub model, and one that verifies no exception is raised for zero vectors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: tk-pkm111 <133480534+tk-pkm111@users.noreply.github.com>
|
Hi @tk-pkm111 — thanks for the careful fix and the clear reproduction. The change looks solid: the unit-norm contract on The one blocker before we can merge is the CLA — the bot still shows it as unsigned for this PR. Could you sign it via the link in CLAassistant's comment above? Once that's in we can move toward landing this. |
Bring the LiteLLM provider in line with the unit-norm contract from sqlite_search_repository.py (lines 65-67): the cosine-similarity formula `1 - L²/2` is correct only for unit-normalized vectors. LiteLLM routes to many backends (Cohere, Vertex, Bedrock, etc.) that do not return normalized embeddings, so normalize at the provider boundary — same fix shape as the parallel FastEmbed change in basicmachines-co#843. Also align the response handling with OpenAIEmbeddingProvider: - attribute access on response items (item.index / item.embedding) - explicit duplicate-index guard Tests cover the three behaviors directly (unit norm, zero-vector pass-through, duplicate-index error) and the existing ordering test now reconstructs the expected normalized vectors so a normalization regression would be caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
CLA is now signed — could you merge this when you get a chance? |
phernandez
left a comment
There was a problem hiding this comment.
Reviewed the FastEmbed normalization change against the sqlite_search_repository unit-vector contract. No blockers found.\n\nLocal verification on PR head 4ef062b:\n- uv run pytest tests/repository/test_fastembed_provider.py -> 8 passed\n- uv run ruff check src/basic_memory/repository/fastembed_provider.py tests/repository/test_fastembed_provider.py -> passed\n\nNote: the repo's full Tests workflow is push-only and did not run for this external PR branch, so this review is based on focused local verification rather than the full CI matrix.
|
Hi @phernandez — thanks again for the thorough review and approval! Just wanted to follow up: I've heard from other users who are running into the same normalization issue with multilingual models, so there's real-world impact beyond my own setup. Would you be able to merge this when you get a chance? No rush if you're in the middle of something — just wanted to make sure it stays on your radar. Thanks so much for your time! |
|
Final maintainer verification before merge:
The full Tests workflow is push-only and does not run automatically on this external fork branch, but this is a focused two-file FastEmbed provider fix with direct coverage for the unit-vector contract. Merging with the repo’s usual squash style now. Thanks again @tk-pkm111. |
|
@tk-pkm111 Hey, just merged this PR, so if you have a chance to test, that would be appreciated. Thanks. |
|
@phernandez — Thank you for merging! I'm ready to test right away. Could you cut a new release when you get a chance so I can install it and report back? |
Summary
sqlite_search_repository.pyconverts vector distance to similarity withsimilarity = 1 - distance²/2, a formula that assumes unit-normalized vectors (the requirement is explicitly stated in comments on lines 65-67 of that file).paraphrase-multilingual-MiniLM-L12-v2— return vectors with norm ≈ 2.9 rather than 1.0. This causes every similarity score to collapse near 0.0, making vector/hybrid search effectively non-functional for those models._embed_batch()after converting raw model output to Pythonfloats. Zero vectors are returned as-is to prevent division errors.Root cause
FastEmbed's default English model (
bge-small-en-v1.5) happens to return unit-normalized vectors, so the bug is invisible with the default config. Switching to any model that does not self-normalize (e.g., the multilingual MiniLM family) exposed the contract violation.Changes
src/basic_memory/repository/fastembed_provider.pyimport math; L2-normalize each vector in_embed_batch()tests/repository/test_fastembed_provider.pyTest plan
uv run --python 3.12 pytest tests/repository/test_fastembed_provider.py -vtest_fastembed_provider_l2_normalizes_output_vectorsconfirms output norm is 1.0 ± 1e-6 for a stub model that returns norm ≈ 2.9test_fastembed_provider_zero_vector_does_not_raiseconfirms noZeroDivisionErroron zero vector🤖 Generated with Claude Code