Skip to content

Update normalization parameters and add estimator params validation#210

Open
avolkov-intel wants to merge 10 commits into
IntelPython:mainfrom
avolkov-intel:dev/anatolyv-normalize-fix
Open

Update normalization parameters and add estimator params validation#210
avolkov-intel wants to merge 10 commits into
IntelPython:mainfrom
avolkov-intel:dev/anatolyv-normalize-fix

Conversation

@avolkov-intel
Copy link
Copy Markdown
Collaborator

@avolkov-intel avolkov-intel commented May 6, 2026

Description

  • Add different normalization options
  • Remove implicit normalization from loaders
  • Add estimator parameters validation

Estimator parameters validation can be useful in case you want to override some parameter like n_jobs using -p algorithm:estimator_params:n_jobs=64 for the benchmarks. Currently it would fail since some estimators don't support n_jobs and you would need to run the benchmarks separately. In this approach this parameter will be simply ignored by estimators that don't support it and warning will be shown.


Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

@avolkov-intel avolkov-intel changed the title Update normalization parameters Update normalization parameters and add estimator params validation May 6, 2026
@avolkov-intel avolkov-intel added bug Something isn't working extend Extend benchmarks labels May 6, 2026
Comment thread configs/regular/logreg.json Outdated
"data": {
"dataset": "hepmass",
"split_kwargs": { "train_size": 0.1, "test_size": null }
"split_kwargs": { "train_size": 0.1, "test_size": null },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is the only case where benchmark behavior changes - is it intended?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was done for a reason but let me check the convergence for both options

Copy link
Copy Markdown
Collaborator Author

@avolkov-intel avolkov-intel May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, there's no effect on the result: accuracy and number of iterations stays the same

Comment thread configs/weekly/dbscan.json Outdated
Comment on lines +7 to +8
"dataset" : "cifar",
"split_kwargs": { "ignore" : true },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"dataset" : "cifar",
"split_kwargs": { "ignore" : true },
"dataset": "cifar",
"split_kwargs": { "ignore": true },

Comment thread configs/regular/svm.json Outdated
Comment thread configs/common/knn.json
Comment thread requirements.txt
Comment thread sklbench/datasets/common.py Outdated
Copy link
Copy Markdown
Contributor

@ethanglaser ethanglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once David's comments are resolved. We should add a json linter step to CI and unify formatting in sklbench configs.

Copy link
Copy Markdown
Contributor

@david-cortes-intel david-cortes-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but could also change array.values to array.to_numpy().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working extend Extend benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants