Remap heap graph ordinals on serialization#618
Conversation
Add the option to avoid ordinal holes in the serialized heap graph.
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
|
Does this pose a risk for stable ordinal mapping with external system which retain their own ordinal maps? We need to identify how the through line from non-vector data remains connected to the associated vectors/embeddings in jvector. If this is already handled effectively here, we should make it very clear in documentation with examples. |
|
Short answer: no, since the client application needs to opt-in to this by calling either Longer answer: Say the application creates a graph with some deleted ordinals. Then it does something like this: OnHeapGraphIndex graph = ...;
var writer = new GraphIndexWriter.Builder(graph, ...) (...) .build(); // no OrdinalMapper supplied
writer.write(...);
graph.save(...);This saves two files. One corresponding to the Additionally, since the heap graph serialization process does not provide an option to pass in a mapper, the application has no choice but to live with the ordinal mismatch and compensate for it by introducing additional code complexity. This PR doesn't attempt to change the disk graph's existing philosophy of ordinal re-numbering during write, it simply adds the same functionality to the heap graph serialization. The key difference is that the user has to "opt-in" to ordinal compaction to prevent nasty surprises in existing code. While I agree that the whole ordinal mapping system could do with more thorough documentation, that's not within the scope of this PR and should be handled separately. |
| public class OnHeapGraphIndex implements MutableGraphIndex { | ||
| // Used for saving and loading OnHeapGraphIndex | ||
| public static final int MAGIC = 0x75EC4012; // JVECTOR, with some imagination | ||
| public static final int SERIALIZE_VERSION = 4; |
There was a problem hiding this comment.
If this is JVector version, consider writing out the entire version (e.g. 4.0.0-rc4)?
MarkWolters
left a comment
There was a problem hiding this comment.
I think the version needs to be updated. If it's intentionally being set to a value < the current version used elsewhere explain why and I'll approve once that's settled.
|
|
||
| out.writeInt(OnHeapGraphIndex.MAGIC); // the magic number | ||
| out.writeInt(4); // The version | ||
| out.writeInt(SERIALIZE_VERSION); // The version |
There was a problem hiding this comment.
Our versioning is frankly a mess, and needs to be refactored completely. That said, we are currently up to version 6 , so this may need to be updated here as well.
| int version = in.readInt(); // The version | ||
| if (version != 4) { | ||
| throw new IOException("Unsupported version: " + version); | ||
| if (version != SERIALIZE_VERSION) { |
| * Does not alter existing ordinals even if the ordinals are not compact, | ||
| * which can happen if some nodes were deleted. | ||
| */ | ||
| @Experimental |
There was a problem hiding this comment.
I know this is neither the point of this PR, nor was it introduced by this PR but these @experimental and @deprecated annotations have been here since 10/10/25. It's nonsensical to combine the 2 annotations to begin with, and based on the fact that the code in question is still in use I would argue that it is neither experimental nor deprecated at this point. Can we just get rid of them?
There is currently no way to alter the ordinals of the vectors in an
OnHeapGraphIndex. This limitation extends to thesaveandloadmethods used for serializing and de-serializing the heap graph.In particular, this means that any ordinal "holes" created by deleting nodes from the
OnHeapGraphIndexcan never be compacted away. This is different from the behavior of the disk graph writer, which not only provides a way to remap ordinals, it automatically compacts ordinals by default.This PR adds the option to remap ordinals while serializing the graph, preventing the accumulation of ordinal holes.
Note that this is only useful for those systems which rely on repeatedly loading, mutating, and then saving the
OnHeapGraphIndex, which is arguably an anti-pattern.