Skip to main content

CheckpointModel

Checkpointing is a technique to reduce the impact of machine failure. When using Checkpointing, tasks make periodical snapshots of their state. If a task fails, it can be restarted from the last snapshot instead of starting from the beginning.

A user can define a checkpoint model using the following parameters:

VariableTypeRequired?DefaultDescription
checkpointIntervalInt64no3600000The time between checkpoints in ms
checkpointDurationInt64no300000The time to create a snapshot in ms
checkpointIntervalScalingDoubleno1.0The scaling of the checkpointInterval after each successful checkpoint. The default of 1.0 means no scaling happens.

Example

{
"checkpointInterval": 3600000,
"checkpointDuration": 300000,
"checkpointIntervalScaling": 1.5
}

In this example, a snapshot is created every hour, and the snapshot creation takes 5 minutes. The checkpointIntervalScaling is set to 1.5, which means that after each successful checkpoint, the interval between checkpoints will be increased by 50% (for example from 1 to 1.5 hours).