CheckpointModel
Checkpointing is a technique to reduce the impact of machine failure. When using Checkpointing, tasks make periodical snapshots of their state. If a task fails, it can be restarted from the last snapshot instead of starting from the beginning.
A user can define a checkpoint model using the following parameters:
Variable | Type | Required? | Default | Description |
---|---|---|---|---|
checkpointInterval | Int64 | no | 3600000 | The time between checkpoints in ms |
checkpointDuration | Int64 | no | 300000 | The time to create a snapshot in ms |
checkpointIntervalScaling | Double | no | 1.0 | The scaling of the checkpointInterval after each successful checkpoint. The default of 1.0 means no scaling happens. |
Example
{
"checkpointInterval": 3600000,
"checkpointDuration": 300000,
"checkpointIntervalScaling": 1.5
}
In this example, a snapshot is created every hour, and the snapshot creation takes 5 minutes. The checkpointIntervalScaling is set to 1.5, which means that after each successful checkpoint, the interval between checkpoints will be increased by 50% (for example from 1 to 1.5 hours).