If you need to configure high reliability for a Flink application, you can set the parameters when creating your Flink jobs.
The reliability configuration of a Flink Jar job is the same as that of a SQL job, which will not be described in this section.
Total number of CUs = Number of manager CUs + (Total number of concurrent operators / Number of slots of a TaskManager) x Number of TaskManager CUs
For example, with a total of 9 CUs (1 manager CU) and a maximum of 16 concurrent jobs, the number of compute-specific CUs is 8.
If you do not configure TaskManager specifications, a TaskManager occupies 1 CU by default and has no slot. To ensure a high reliability, set the number of slots of the TaskManager to 2, according to the preceding formula.
Set the maximum number of concurrent jobs be twice the number of CUs.
DLI provides various monitoring metrics for Flink jobs. You can define alarm rules as required using different monitoring metrics for fine-grained job monitoring.