Checkpoints - what, when, how and why?

@robertl seemed to imply in this thread that when training is completed, perceptilabs tries to delete checkpoint files (presumably before writing a new one, but even with this approach it would be nicer if it wrote the new one first, and only deleted the old one after a new file had been written successfully)

OK, it’s not saving checkpoints after training for me so maybe my info is incomplete, but currently a sample checkpoint folder contains only “checkpoint_model.json”

If the checkpoint is a .json then I don’t see how that code deletes it… and currently the only example named checkpoint I have is indeed a json file ‘checkpoint_model.json’. Can you clarify?

(Especially - what could OP’s system have found to delete? Training checkpoint files not json? Implying some checkpoints (e.g. model) are json and others aren’t (e.g. training) - that would be confusing.)

Also: even if… it seems like a missed opportunity: why not just add a new timestamped checkpoint file and allow the user to choose which (latest by default) checkpoint is used for other actions that rely on this file - then the user could compare latest vs earlier.

Hi,

Agreed with writing the checkpoints first and then deleting them, we will look at re-adjusting that order :slight_smile:
The idea is to at some point put these under version control, at which point it will also be possible to go back to a previous trained model.
Having multiple checkpoints might be a good first version though. And we are looking to let different versions of a model (different checkpoints) be compared to each other.

The checkpoint_model.json is not the checkpoints, the checkpoints for Tensorflow looks something like this:
image
The checkpoint_model.json to freeze how the PerceptiLabs model which created those checkpoints looked like. This is so that you can modify the normal model.json as much as you want without having to worry about a mismatch with your latest trained models checkpoints.

Maybe it’s just me, but I find that very confusing. There’s the structure of the model, and the trained weights etc. checkpoint.json just records node relationships and properties, the others are different. There are structure checkpoints and training checkpoints?

Would it be possible to shift the version info into the filename rather than the extension?

I might, for example, associate an extension with Notepad++… as illustrated there is no fixed (implied) file type.

Hmm, I don’t quite follow. The checkpoints are TensorFlow’s standard format. I do agree that they are a bit confusing though.

There are only one type of checkpoints, which are produced automatically after you train a model.

(Hmm, I don’t follow.) ^2 :slight_smile:

The checkpoint_model.json is not the checkpoints

Um. Then the word “checkpoint” in the filename refers to something different to the “checkpoint” at the end of the sentence… I think having one word to refer to two different things in such close proximity and functionality is awkward, especially when the one thing that is closest to the standard (TF) checkpoint does not contain the word (just the abbreviation ckpt).

Model checkpoint vs training (of the model) checkpoint seems to be the cause of my confusion.

Maybe we should call the checkpoint_model.json “frozen_model.json” instead, that might be more accurate and more intuitive :thinking:

Then I’d be confused about what “frozen_” meant :wink:

General question: (I think we have touched on this before but can’t remember when/how detailed).

Models evolve, typically under the selective pressure of training results :slight_smile: . It would be good to keep model (structure & state) tightly linked to training so that a model update does not automatically invalidate/render inaccessible training data.

I almost want to suggest that the focus is in the training… an untrained model may be saved, but as soon as it is trained, a frozen copy is created and linked to the training results… and subsequent changes to the model are just ordinary saves again. From the training results one would be able to view the model that delivered those results.

Is that what you were in fact implying?

That’s exactly how it’s currently working :slight_smile:

Nice!!! [pad to min post len :slight_smile: ]