Skip to content

Conversation

@jwa7
Copy link
Member

@jwa7 jwa7 commented Oct 20, 2025

The way that num_epochs is handled is currently broken. Say for example I run training for 1000 epochs, but it runs out of time after 600 epochs. Restarting from the checkpoint at epoch 600, and keeping num_epochs = 1000 results in a total of 1600 epochs being run. This also messes up the cosine scheduling, which 'resets' after 1000 and starts increasing again.

This PR fixes these issues, depending on the context:

  • restart -> the epoch attribute of the Trainer when loaded from checkpoint is the one corresponding to the epoch, i.e. 600, and the training runs up to the max epoch number, i.e. 1000, then stops.
  • finetune -> the epoch attribute of the Trainer resets to 0 when loaded from checkpoint, and training runs from zero -> num_epochs.

Contributor (creator of pull-request) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

Maintainer/Reviewer checklist

  • CHANGELOG updated with public API or any other important changes?
  • GPU tests passed (maintainer comment: "cscs-ci run")?

📚 Documentation preview 📚: https://metatrain--845.org.readthedocs.build/en/845/

@jwa7
Copy link
Member Author

jwa7 commented Oct 21, 2025

cscs_ci run

@frostedoyster
Copy link
Collaborator

cscs-ci run

@frostedoyster frostedoyster enabled auto-merge (squash) October 22, 2025 05:06
@frostedoyster
Copy link
Collaborator

cscs-ci run

1 similar comment
@frostedoyster
Copy link
Collaborator

cscs-ci run

@jwa7 jwa7 disabled auto-merge October 22, 2025 10:58
@cesaremalosso
Copy link
Contributor

cesaremalosso commented Oct 23, 2025

may I add a point... there's also a problem when you restart a finetuning. The training restart correctly from the model checkpoint, but the LR reset and it goes back to the initial one...

for example this was an interrupted finetuning run:

[2025-10-23 04:42:22][INFO] - Epoch:  298 | learning rate: 8.041e-05 | training loss: 5.060e+00 | training energy RMSE (per atom): 171.55 meV | training energy MAE (per atom): 53.823 meV | training forces RMSE:  622.8 meV/A | training forces MAE:  96.89 meV/A | training virial RMSE (per atom):  258.3 meV | training virial MAE (per atom):  42.35 meV | training non_conservative_forces RMSE (per atom): 144.47 meV/A | training non_conservative_forces MAE (per atom): 75.487 meV/A | training non_conservative_stress RMSE: 3.7582 meV/A^3 | training non_conservative_stress MAE: 1.6637 meV/A^3 | validation loss: 1.355e+00 | validation energy RMSE (per atom): 197.24 meV | validation energy MAE (per atom): 69.303 meV | validation forces RMSE:   384.9 meV/A | validation forces MAE:  76.58 meV/A | validation virial RMSE (per atom):  546.0 meV | validation virial MAE (per atom):  95.06 meV | validation non_conservative_forces RMSE (per atom): 179.58 meV/A | validation non_conservative_forces MAE (per atom):  84.34 meV/A | validation non_conservative_stress RMSE: 4.6708 meV/A^3 | validation non_conservative_stress MAE: 1.7440 meV/A^3

And if I restart it with: mtt train options_restart.yaml --restart outputs/2025-10-22/04-43-43/model_298.ckpt it starts again with the warm-up etc...:

[2025-10-23 13:56:27][INFO] - Epoch:  299 | learning rate: 1.000e-05 | training loss: 4.495e+00 | training energy RMSE (per atom): 170.54 meV | training energy MAE (per atom): 48.852 meV | training forces RMSE: 416.96 meV/A | training forces MAE: 54.963 meV/A | training virial RMSE (per atom): 420.24 meV | training virial MAE (per atom): 55.639 meV | training non_conservative_forces RMSE (per atom): 136.45 meV/A | training non_conservative_forces MAE (per atom): 69.923 meV/A | training non_conservative_stress RMSE: 3.6618 meV/A^3 | training non_conservative_stress MAE: 1.5968 meV/A^3 | validation loss: 1.317e+00 | validation energy RMSE (per atom): 197.28 meV | validation energy MAE (per atom): 66.730 meV | validation forces RMSE: 383.38 meV/A | validation forces MAE: 71.605 meV/A | validation virial RMSE (per atom): 567.69 meV | validation virial MAE (per atom): 93.602 meV | validation non_conservative_forces RMSE (per atom): 174.05 meV/A | validation non_conservative_forces MAE (per atom): 79.797 meV/A | validation non_conservative_stress RMSE: 4.5177 meV/A^3 | validation non_conservative_stress MAE: 1.6749 meV/A^3
[2025-10-23 14:01:12][INFO] - Epoch:  300 | learning rate: 2.000e-05 | training loss: 4.294e+00 | training energy RMSE (per atom): 170.49 meV | training energy MAE (per atom): 48.287 meV | training forces RMSE: 335.67 meV/A | training forces MAE: 42.394 meV/A | training virial RMSE (per atom): 512.40 meV | training virial MAE (per atom): 68.200 meV | training non_conservative_forces RMSE (per atom): 134.56 meV/A | training non_conservative_forces MAE (per atom): 69.049 meV/A | training non_conservative_stress RMSE: 3.6257 meV/A^3 | training non_conservative_stress MAE: 1.5812 meV/A^3 | validation loss: 1.317e+00 | validation energy RMSE (per atom): 197.53 meV | validation energy MAE (per atom): 67.144 meV | validation forces RMSE: 403.73 meV/A | validation forces MAE: 72.994 meV/A | validation virial RMSE (per atom): 542.08 meV | validation virial MAE (per atom): 92.326 meV | validation non_conservative_forces RMSE (per atom): 174.21 meV/A | validation non_conservative_forces MAE (per atom): 80.250 meV/A | validation non_conservative_stress RMSE: 4.5558 meV/A^3 | validation non_conservative_stress MAE: 1.6786 meV/A^3
[2025-10-23 14:05:56][INFO] - Epoch:  301 | learning rate: 3.000e-05 | training loss: 4.280e+00 | training energy RMSE (per atom): 170.49 meV | training energy MAE (per atom): 48.651 meV | training forces RMSE: 343.55 meV/A | training forces MAE: 42.557 meV/A | training virial RMSE (per atom): 504.71 meV | training virial MAE (per atom): 67.709 meV | training non_conservative_forces RMSE (per atom): 134.71 meV/A | training non_conservative_forces MAE (per atom): 69.403 meV/A | training non_conservative_stress RMSE: 3.6218 meV/A^3 | training non_conservative_stress MAE: 1.5841 meV/A^3 | validation loss: 1.324e+00 | validation energy RMSE (per atom): 197.02 meV | validation energy MAE (per atom): 67.744 meV | validation forces RMSE: 458.58 meV/A | validation forces MAE: 80.907 meV/A | validation virial RMSE (per atom): 475.74 meV | validation virial MAE (per atom): 82.243 meV | validation non_conservative_forces RMSE (per atom): 174.64 meV/A | validation non_conservative_forces MAE (per atom): 81.251 meV/A | validation non_conservative_stress RMSE: 4.5987 meV/A^3 | validation non_conservative_stress MAE: 1.6874 meV/A^3
[2025-10-23 14:10:40][INFO] - Epoch:  302 | learning rate: 4.000e-05 | training loss: 4.366e+00 | training energy RMSE (per atom): 170.49 meV | training energy MAE (per atom): 49.436 meV | training forces RMSE: 337.55 meV/A | training forces MAE: 43.380 meV/A | training virial RMSE (per atom): 510.38 meV | training virial MAE (per atom): 68.950 meV | training non_conservative_forces RMSE (per atom): 136.11 meV/A | training non_conservative_forces MAE (per atom): 70.397 meV/A | training non_conservative_stress RMSE: 3.6243 meV/A^3 | training non_conservative_stress MAE: 1.6005 meV/A^3 | validation loss: 1.350e+00 | validation energy RMSE (per atom): 198.99 meV | validation energy MAE (per atom): 71.101 meV | validation forces RMSE: 549.81 meV/A | validation forces MAE: 93.984 meV/A | validation virial RMSE (per atom): 406.02 meV | validation virial MAE (per atom): 75.842 meV | validation non_conservative_forces RMSE (per atom): 176.73 meV/A | validation non_conservative_forces MAE (per atom): 83.411 meV/A | validation non_conservative_stress RMSE: 4.6374 meV/A^3 | validation non_conservative_stress MAE: 1.7158 meV/A^3
[2025-10-23 14:15:23][INFO] - Epoch:  303 | learning rate: 5.000e-05 | training loss: 4.501e+00 | training energy RMSE (per atom): 170.75 meV | training energy MAE (per atom): 50.604 meV | training forces RMSE: 423.63 meV/A | training forces MAE: 55.782 meV/A | training virial RMSE (per atom): 414.68 meV | training virial MAE (per atom): 55.447 meV | training non_conservative_forces RMSE (per atom): 137.86 meV/A | training non_conservative_forces MAE (per atom): 71.577 meV/A | training non_conservative_stress RMSE: 3.6764 meV/A^3 | training non_conservative_stress MAE: 1.6147 meV/A^3 | validation loss: 1.341e+00 | validation energy RMSE (per atom): 197.98 meV | validation energy MAE (per atom): 68.589 meV | validation forces RMSE: 471.03 meV/A | validation forces MAE: 84.136 meV/A | validation virial RMSE (per atom): 457.35 meV | validation virial MAE (per atom): 82.413 meV | validation non_conservative_forces RMSE (per atom): 177.23 meV/A | validation non_conservative_forces MAE (per atom): 82.902 meV/A | validation non_conservative_stress RMSE: 4.5530 meV/A^3 | validation non_conservative_stress MAE: 1.6940 meV/A^3
[2025-10-23 14:20:06][INFO] - Epoch:  304 | learning rate: 6.000e-05 | training loss: 4.645e+00 | training energy RMSE (per atom): 170.95 meV | training energy MAE (per atom): 51.450 meV | training forces RMSE: 270.77 meV/A | training forces MAE: 38.802 meV/A | training virial RMSE (per atom): 609.81 meV | training virial MAE (per atom): 91.949 meV | training non_conservative_forces RMSE (per atom): 139.35 meV/A | training non_conservative_forces MAE (per atom): 72.574 meV/A | training non_conservative_stress RMSE: 3.6777 meV/A^3 | training non_conservative_stress MAE: 1.6277 meV/A^3 | validation loss: 1.370e+00 | validation energy RMSE (per atom): 198.07 meV | validation energy MAE (per atom): 70.612 meV | validation forces RMSE: 402.02 meV/A | validation forces MAE: 76.569 meV/A | validation virial RMSE (per atom): 565.93 meV | validation virial MAE (per atom): 96.267 meV | validation non_conservative_forces RMSE (per atom): 178.56 meV/A | validation non_conservative_forces MAE (per atom): 84.087 meV/A | validation non_conservative_stress RMSE: 4.6424 meV/A^3 | validation non_conservative_stress MAE: 1.7046 meV/A^3
[2025-10-23 14:24:49][INFO] - Epoch:  305 | learning rate: 7.000e-05 | training loss: 4.917e+00 | training energy RMSE (per atom): 171.00 meV | training energy MAE (per atom): 52.896 meV | training forces RMSE: 190.59 meV/A | training forces MAE: 35.255 meV/A | training virial RMSE (per atom): 751.25 meV | training virial MAE (per atom): 127.619 meV | training non_conservative_forces RMSE (per atom): 141.94 meV/A | training non_conservative_forces MAE (per atom): 74.304 meV/A | training non_conservative_stress RMSE: 3.7463 meV/A^3 | training non_conservative_stress MAE: 1.6476 meV/A^3 | validation loss: 1.346e+00 | validation energy RMSE (per atom): 198.44 meV | validation energy MAE (per atom): 69.522 meV | validation forces RMSE: 529.78 meV/A | validation forces MAE: 94.333 meV/A | validation virial RMSE (per atom): 388.73 meV | validation virial MAE (per atom): 77.196 meV | validation non_conservative_forces RMSE (per atom): 179.86 meV/A | validation non_conservative_forces MAE (per atom): 84.446 meV/A | validation non_conservative_stress RMSE: 4.6206 meV/A^3 | validation non_conservative_stress MAE: 1.7248 meV/A^3
[2025-10-23 14:29:33][INFO] - Epoch:  306 | learning rate: 8.000e-05 | training loss: 5.116e+00 | training energy RMSE (per atom): 171.18 meV | training energy MAE (per atom): 54.190 meV | training forces RMSE: 335.42 meV/A | training forces MAE: 50.667 meV/A | training virial RMSE (per atom): 513.58 meV | training virial MAE (per atom): 76.244 meV | training non_conservative_forces RMSE (per atom): 144.52 meV/A | training non_conservative_forces MAE (per atom): 75.925 meV/A | training non_conservative_stress RMSE: 3.7998 meV/A^3 | training non_conservative_stress MAE: 1.6714 meV/A^3 | validation loss: 1.403e+00 | validation energy RMSE (per atom): 201.20 meV | validation energy MAE (per atom): 79.121 meV | validation forces RMSE: 387.85 meV/A | validation forces MAE: 78.384 meV/A | validation virial RMSE (per atom): 540.25 meV | validation virial MAE (per atom): 98.269 meV | validation non_conservative_forces RMSE (per atom): 182.71 meV/A | validation non_conservative_forces MAE (per atom): 87.862 meV/A | validation non_conservative_stress RMSE: 4.7495 meV/A^3 | validation non_conservative_stress MAE: 1.8021 meV/A^3

@abmazitov
Copy link
Contributor

may I add a point... there's also a problem when you restart a finetuning. The training restart correctly from the model checkpoint, but the LR reset and it goes back to the initial one...

for example this was an interrupted finetuning run:

[2025-10-23 04:42:22][INFO] - Epoch:  298 | learning rate: 8.041e-05 | training loss: 5.060e+00 | training energy RMSE (per atom): 171.55 meV | training energy MAE (per atom): 53.823 meV | training forces RMSE:  622.8 meV/A | training forces MAE:  96.89 meV/A | training virial RMSE (per atom):  258.3 meV | training virial MAE (per atom):  42.35 meV | training non_conservative_forces RMSE (per atom): 144.47 meV/A | training non_conservative_forces MAE (per atom): 75.487 meV/A | training non_conservative_stress RMSE: 3.7582 meV/A^3 | training non_conservative_stress MAE: 1.6637 meV/A^3 | validation loss: 1.355e+00 | validation energy RMSE (per atom): 197.24 meV | validation energy MAE (per atom): 69.303 meV | validation forces RMSE:   384.9 meV/A | validation forces MAE:  76.58 meV/A | validation virial RMSE (per atom):  546.0 meV | validation virial MAE (per atom):  95.06 meV | validation non_conservative_forces RMSE (per atom): 179.58 meV/A | validation non_conservative_forces MAE (per atom):  84.34 meV/A | validation non_conservative_stress RMSE: 4.6708 meV/A^3 | validation non_conservative_stress MAE: 1.7440 meV/A^3

And if I restart it with: mtt train options_restart.yaml --restart outputs/2025-10-22/04-43-43/model_298.ckpt it starts again with the warm-up etc...:

[2025-10-23 13:56:27][INFO] - Epoch:  299 | learning rate: 1.000e-05 | training loss: 4.495e+00 | training energy RMSE (per atom): 170.54 meV | training energy MAE (per atom): 48.852 meV | training forces RMSE: 416.96 meV/A | training forces MAE: 54.963 meV/A | training virial RMSE (per atom): 420.24 meV | training virial MAE (per atom): 55.639 meV | training non_conservative_forces RMSE (per atom): 136.45 meV/A | training non_conservative_forces MAE (per atom): 69.923 meV/A | training non_conservative_stress RMSE: 3.6618 meV/A^3 | training non_conservative_stress MAE: 1.5968 meV/A^3 | validation loss: 1.317e+00 | validation energy RMSE (per atom): 197.28 meV | validation energy MAE (per atom): 66.730 meV | validation forces RMSE: 383.38 meV/A | validation forces MAE: 71.605 meV/A | validation virial RMSE (per atom): 567.69 meV | validation virial MAE (per atom): 93.602 meV | validation non_conservative_forces RMSE (per atom): 174.05 meV/A | validation non_conservative_forces MAE (per atom): 79.797 meV/A | validation non_conservative_stress RMSE: 4.5177 meV/A^3 | validation non_conservative_stress MAE: 1.6749 meV/A^3
[2025-10-23 14:01:12][INFO] - Epoch:  300 | learning rate: 2.000e-05 | training loss: 4.294e+00 | training energy RMSE (per atom): 170.49 meV | training energy MAE (per atom): 48.287 meV | training forces RMSE: 335.67 meV/A | training forces MAE: 42.394 meV/A | training virial RMSE (per atom): 512.40 meV | training virial MAE (per atom): 68.200 meV | training non_conservative_forces RMSE (per atom): 134.56 meV/A | training non_conservative_forces MAE (per atom): 69.049 meV/A | training non_conservative_stress RMSE: 3.6257 meV/A^3 | training non_conservative_stress MAE: 1.5812 meV/A^3 | validation loss: 1.317e+00 | validation energy RMSE (per atom): 197.53 meV | validation energy MAE (per atom): 67.144 meV | validation forces RMSE: 403.73 meV/A | validation forces MAE: 72.994 meV/A | validation virial RMSE (per atom): 542.08 meV | validation virial MAE (per atom): 92.326 meV | validation non_conservative_forces RMSE (per atom): 174.21 meV/A | validation non_conservative_forces MAE (per atom): 80.250 meV/A | validation non_conservative_stress RMSE: 4.5558 meV/A^3 | validation non_conservative_stress MAE: 1.6786 meV/A^3
[2025-10-23 14:05:56][INFO] - Epoch:  301 | learning rate: 3.000e-05 | training loss: 4.280e+00 | training energy RMSE (per atom): 170.49 meV | training energy MAE (per atom): 48.651 meV | training forces RMSE: 343.55 meV/A | training forces MAE: 42.557 meV/A | training virial RMSE (per atom): 504.71 meV | training virial MAE (per atom): 67.709 meV | training non_conservative_forces RMSE (per atom): 134.71 meV/A | training non_conservative_forces MAE (per atom): 69.403 meV/A | training non_conservative_stress RMSE: 3.6218 meV/A^3 | training non_conservative_stress MAE: 1.5841 meV/A^3 | validation loss: 1.324e+00 | validation energy RMSE (per atom): 197.02 meV | validation energy MAE (per atom): 67.744 meV | validation forces RMSE: 458.58 meV/A | validation forces MAE: 80.907 meV/A | validation virial RMSE (per atom): 475.74 meV | validation virial MAE (per atom): 82.243 meV | validation non_conservative_forces RMSE (per atom): 174.64 meV/A | validation non_conservative_forces MAE (per atom): 81.251 meV/A | validation non_conservative_stress RMSE: 4.5987 meV/A^3 | validation non_conservative_stress MAE: 1.6874 meV/A^3
[2025-10-23 14:10:40][INFO] - Epoch:  302 | learning rate: 4.000e-05 | training loss: 4.366e+00 | training energy RMSE (per atom): 170.49 meV | training energy MAE (per atom): 49.436 meV | training forces RMSE: 337.55 meV/A | training forces MAE: 43.380 meV/A | training virial RMSE (per atom): 510.38 meV | training virial MAE (per atom): 68.950 meV | training non_conservative_forces RMSE (per atom): 136.11 meV/A | training non_conservative_forces MAE (per atom): 70.397 meV/A | training non_conservative_stress RMSE: 3.6243 meV/A^3 | training non_conservative_stress MAE: 1.6005 meV/A^3 | validation loss: 1.350e+00 | validation energy RMSE (per atom): 198.99 meV | validation energy MAE (per atom): 71.101 meV | validation forces RMSE: 549.81 meV/A | validation forces MAE: 93.984 meV/A | validation virial RMSE (per atom): 406.02 meV | validation virial MAE (per atom): 75.842 meV | validation non_conservative_forces RMSE (per atom): 176.73 meV/A | validation non_conservative_forces MAE (per atom): 83.411 meV/A | validation non_conservative_stress RMSE: 4.6374 meV/A^3 | validation non_conservative_stress MAE: 1.7158 meV/A^3
[2025-10-23 14:15:23][INFO] - Epoch:  303 | learning rate: 5.000e-05 | training loss: 4.501e+00 | training energy RMSE (per atom): 170.75 meV | training energy MAE (per atom): 50.604 meV | training forces RMSE: 423.63 meV/A | training forces MAE: 55.782 meV/A | training virial RMSE (per atom): 414.68 meV | training virial MAE (per atom): 55.447 meV | training non_conservative_forces RMSE (per atom): 137.86 meV/A | training non_conservative_forces MAE (per atom): 71.577 meV/A | training non_conservative_stress RMSE: 3.6764 meV/A^3 | training non_conservative_stress MAE: 1.6147 meV/A^3 | validation loss: 1.341e+00 | validation energy RMSE (per atom): 197.98 meV | validation energy MAE (per atom): 68.589 meV | validation forces RMSE: 471.03 meV/A | validation forces MAE: 84.136 meV/A | validation virial RMSE (per atom): 457.35 meV | validation virial MAE (per atom): 82.413 meV | validation non_conservative_forces RMSE (per atom): 177.23 meV/A | validation non_conservative_forces MAE (per atom): 82.902 meV/A | validation non_conservative_stress RMSE: 4.5530 meV/A^3 | validation non_conservative_stress MAE: 1.6940 meV/A^3
[2025-10-23 14:20:06][INFO] - Epoch:  304 | learning rate: 6.000e-05 | training loss: 4.645e+00 | training energy RMSE (per atom): 170.95 meV | training energy MAE (per atom): 51.450 meV | training forces RMSE: 270.77 meV/A | training forces MAE: 38.802 meV/A | training virial RMSE (per atom): 609.81 meV | training virial MAE (per atom): 91.949 meV | training non_conservative_forces RMSE (per atom): 139.35 meV/A | training non_conservative_forces MAE (per atom): 72.574 meV/A | training non_conservative_stress RMSE: 3.6777 meV/A^3 | training non_conservative_stress MAE: 1.6277 meV/A^3 | validation loss: 1.370e+00 | validation energy RMSE (per atom): 198.07 meV | validation energy MAE (per atom): 70.612 meV | validation forces RMSE: 402.02 meV/A | validation forces MAE: 76.569 meV/A | validation virial RMSE (per atom): 565.93 meV | validation virial MAE (per atom): 96.267 meV | validation non_conservative_forces RMSE (per atom): 178.56 meV/A | validation non_conservative_forces MAE (per atom): 84.087 meV/A | validation non_conservative_stress RMSE: 4.6424 meV/A^3 | validation non_conservative_stress MAE: 1.7046 meV/A^3
[2025-10-23 14:24:49][INFO] - Epoch:  305 | learning rate: 7.000e-05 | training loss: 4.917e+00 | training energy RMSE (per atom): 171.00 meV | training energy MAE (per atom): 52.896 meV | training forces RMSE: 190.59 meV/A | training forces MAE: 35.255 meV/A | training virial RMSE (per atom): 751.25 meV | training virial MAE (per atom): 127.619 meV | training non_conservative_forces RMSE (per atom): 141.94 meV/A | training non_conservative_forces MAE (per atom): 74.304 meV/A | training non_conservative_stress RMSE: 3.7463 meV/A^3 | training non_conservative_stress MAE: 1.6476 meV/A^3 | validation loss: 1.346e+00 | validation energy RMSE (per atom): 198.44 meV | validation energy MAE (per atom): 69.522 meV | validation forces RMSE: 529.78 meV/A | validation forces MAE: 94.333 meV/A | validation virial RMSE (per atom): 388.73 meV | validation virial MAE (per atom): 77.196 meV | validation non_conservative_forces RMSE (per atom): 179.86 meV/A | validation non_conservative_forces MAE (per atom): 84.446 meV/A | validation non_conservative_stress RMSE: 4.6206 meV/A^3 | validation non_conservative_stress MAE: 1.7248 meV/A^3
[2025-10-23 14:29:33][INFO] - Epoch:  306 | learning rate: 8.000e-05 | training loss: 5.116e+00 | training energy RMSE (per atom): 171.18 meV | training energy MAE (per atom): 54.190 meV | training forces RMSE: 335.42 meV/A | training forces MAE: 50.667 meV/A | training virial RMSE (per atom): 513.58 meV | training virial MAE (per atom): 76.244 meV | training non_conservative_forces RMSE (per atom): 144.52 meV/A | training non_conservative_forces MAE (per atom): 75.925 meV/A | training non_conservative_stress RMSE: 3.7998 meV/A^3 | training non_conservative_stress MAE: 1.6714 meV/A^3 | validation loss: 1.403e+00 | validation energy RMSE (per atom): 201.20 meV | validation energy MAE (per atom): 79.121 meV | validation forces RMSE: 387.85 meV/A | validation forces MAE: 78.384 meV/A | validation virial RMSE (per atom): 540.25 meV | validation virial MAE (per atom): 98.269 meV | validation non_conservative_forces RMSE (per atom): 182.71 meV/A | validation non_conservative_forces MAE (per atom): 87.862 meV/A | validation non_conservative_stress RMSE: 4.7495 meV/A^3 | validation non_conservative_stress MAE: 1.8021 meV/A^3

I think I have an idea how to fix it, let me add some changes

@abmazitov
Copy link
Contributor

cscs-ci run

@frostedoyster
Copy link
Collaborator

cscs-ci run

1 similar comment
@frostedoyster
Copy link
Collaborator

cscs-ci run

Copy link
Contributor

@abmazitov abmazitov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@frostedoyster
Copy link
Collaborator

cscs-ci run

@frostedoyster frostedoyster enabled auto-merge (squash) October 31, 2025 16:32
@frostedoyster
Copy link
Collaborator

cscs-ci run

2 similar comments
@frostedoyster
Copy link
Collaborator

cscs-ci run

@Luthaf
Copy link
Member

Luthaf commented Nov 3, 2025

cscs-ci run

@frostedoyster
Copy link
Collaborator

cscs-ci run

@frostedoyster
Copy link
Collaborator

cscs-ci run

@frostedoyster
Copy link
Collaborator

cscs-ci run

@jwa7
Copy link
Member Author

jwa7 commented Nov 11, 2025

Maybe we need a 5 minute chat in person about this one @abmazitov @frostedoyster to sort it out 😅

@frostedoyster
Copy link
Collaborator

@jwa7 I think @abmazitov and I would rather keep typing "cscs-ci run" until they pass

@frostedoyster
Copy link
Collaborator

cscs-ci run

@Luthaf
Copy link
Member

Luthaf commented Nov 18, 2025

The test error for distributed tests looks a lot like what I had to fix for #922, but it is weird that it did not show there. It might just be a checkpoint update that is not doing what it needs to.

The tox test error looks relevant (trying to use matmul with int/float)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants