Skip to content

Save hash and path grid informations when generating FK Tables#247

Open
kamillaurent wants to merge 4 commits into
mainfrom
hash_theory
Open

Save hash and path grid informations when generating FK Tables#247
kamillaurent wants to merge 4 commits into
mainfrom
hash_theory

Conversation

@kamillaurent
Copy link
Copy Markdown

@kamillaurent kamillaurent commented Apr 28, 2026

When a FK Table is generated, the hash information and path of the pineappl grid used for it are saved in the metadata of the FK Table.

These information can be read using the pinealppl-cli, for example:

pineappl read --show ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl.lz4

outputs, between the other informations, this lines:

grid_files: {"ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl": {"hash": "f1110fd0bb8ccddf28bf722f9f9c3dd5", "path": "/data/theorie/klaurent/FKTables/pineko/data/grids/41000000/ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl.lz4"}}

This PR is related to issue #225

Copy link
Copy Markdown
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
(haven't tested it, but I trust the OP is a real-life test)

Comment thread src/pineko/evolve.py Outdated
if grid_path is not None:
grid_path_obj = pathlib.Path(grid_path)
grid_hash = hashlib.md5(grid_path_obj.read_bytes()).hexdigest()
grid_files = {grid_path_obj.stem: {"hash": grid_hash, "path": str(grid_path_obj.resolve())}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
grid_files = {grid_path_obj.stem: {"hash": grid_hash, "path": str(grid_path_obj.resolve())}}
grid_files = {grid_path_obj.stem: {"hash": grid_hash, "theory_folder": grid_path_obj.parent.name}}

Out of privacy concerns, I would just store the path starting from the theory_slim folder.

My suggestion is, since you have the grid name as the key, save just the theory folder name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed on the privacy, but I think I would keep the resolve(), because I think this way we can resolve symbolic links, right? i.e. .resolve().parent (or similar)

@scarlehoff
Copy link
Copy Markdown
Member

Hi @kamillaurent please do the small change requested and fix the issues found by pre-commit (usually just means you didn't run pre-commit, if you do the changes then git add the evolve.py file and run pre-commit it will be fixed.

Copy link
Copy Markdown
Contributor

@felixhekhorn felixhekhorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember the other things we discussed:

  • split in two simple keys "grid_path" and "grid_hash"
  • address FONLL magic

Comment thread src/pineko/evolve.py Outdated
if grid_path is not None:
grid_path_obj = pathlib.Path(grid_path)
grid_hash = hashlib.md5(grid_path_obj.read_bytes()).hexdigest()
grid_files = {grid_path_obj.stem: {"hash": grid_hash, "path": str(grid_path_obj.resolve())}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed on the privacy, but I think I would keep the resolve(), because I think this way we can resolve symbolic links, right? i.e. .resolve().parent (or similar)

@scarlehoff
Copy link
Copy Markdown
Member

@kamillaurent please address the changes and run pre-commit so that we can merge this

@scarlehoff
Copy link
Copy Markdown
Member

There's a few conflicts due to having merge @Radonirinaunimi's PR but they are in the documentation of the functions so it should be quick to fix. It would be good to have and use this PR for the next batch of fk tables so we have this information in 😅

Kamil Laurent added 2 commits May 26, 2026 09:57
@kamillaurent
Copy link
Copy Markdown
Author

Now I made the informations in the metadata more readable. For example, using the command
pineappl read --show ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl.lz4 one can find the informations:

...
grid_hash: f1110fd0bb8ccddf28bf722f9f9c3dd5
grid_path: /data/grids/41000000/ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl.lz4
grid_theory: 41000000
...

When we use the FONLL procedure, in the last step many FK-tables get merged into one. I am working on retaining the info about hash and theory of the original FKs into the merged one.

Comment thread src/pineko/evolve.py Outdated
Comment on lines +117 to +131
if grid_path is not None:
grid_path_obj = pathlib.Path(grid_path).resolve()
grid_hash = hashlib.md5(grid_path_obj.read_bytes()).hexdigest()
grid_path_parts = grid_path_obj.parts
if "pineko" in grid_path_parts:
pineko_idx = grid_path_parts.index("pineko")
display_path = str(pathlib.Path("/", *grid_path_parts[pineko_idx + 1 :]))
elif "data" in grid_path_parts:
data_idx = grid_path_parts.index("data")
display_path = str(pathlib.Path("/", *grid_path_parts[data_idx:]))
else:
display_path = grid_path_obj.name
fktable.set_metadata("grid_hash", grid_hash)
fktable.set_metadata("grid_theory", grid_path_obj.parent.name)
fktable.set_metadata("grid_path", display_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is far too much code duplication - this needs to be encapsulated
  2. do we really need grid_path? this looks fairly complicated ...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the grids path part is a bit more complicated than needed, data/grids are not important.

Once you have the grid path, the "path of the grid itself" is grid_path.name and the theory folder grid_path.parent.name. You don't need anything else.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to simplify the metadata, using a function and calling it when the FK table is constructed (so there is no code duplication):

def set_fktable_metadata(fktable, theory_meta, grid_path=None):
    """Store the common FK metadata and, if available, grid provenance."""
    fktable.set_metadata("pineko_version", version.__version__)
    fktable.set_metadata("theory_card", json.dumps(theory_meta))
    if grid_path is None:
        return

    grid_path_obj = pathlib.Path(grid_path).resolve()
    grid_hash = hashlib.md5(grid_path_obj.read_bytes()).hexdigest()
    fktable.set_metadata("grid_hash", grid_hash)
    fktable.set_metadata("grid_theory", grid_path_obj.parent.name)
    fktable.set_metadata("grid_name", grid_path_obj.name)

For consistency I moved into the function also pineko_version and theory_card. I removed the path since the relevant information is the name of the folder, which is also the theory ID.

@kamillaurent
Copy link
Copy Markdown
Author

@scarlehoff regarding the FONLL procedure, fonll.py merges all the FK-Tables into the $THEORY_ID_00 and retains informations only from this one. For example, pineappl read --show 41000000/HERA_NC_225GEV_EP_SIGMARED.pineappl.lz4 gives

...
eko_version: 0.15.3
fk_assumptions: Nf4Ind
fktable_id: HERACOMBNCEP460
grid_hash: 0a34218e5e6b254475243264d1ced40e
grid_name: HERA_NC_225GEV_EP_SIGMARED.pineappl.lz4
grid_theory: 4100000000
hepdata: 10.17182/hepdata.68951.v1/t4
nnpdf_id: HERACOMBNCEP460
...

Is this sufficient or we want informations about all the merged grids? In the case we want all the merged grids, the grid_name will be the same for all. The grid_hash and grid_theory will be different, should we store them into an array and store the array in the metadata of the merged FK Table?

@scarlehoff
Copy link
Copy Markdown
Member

I would say that grid theory should strip the extra 0s (so it should say just 41000000, since that covers all).
Grid name is fine.

And my preference for the hash would be simply a string of

00-0a34218e5e6b254475243264d1ced40e 01-hash 02-hash 03-...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants