@
kapaseker You're right. The problem is that after an error is introduced, other nodes lack protection mechanisms, for example:
1. The BOT management module design lacks exception tolerance, which stems from two issues: an implicit assumption about the scale of the input data and an exception handling strategy when the assumptions are exceeded.
The current approach is to stop upon failure.
2. Internal changes lack a global impact assessment. For example, the optimization and adjustment of ClickHouse's permission management might affect surrounding coupled components. Could you run a real query in the test environment to test this?
3. Insufficient monitoring and fault drills. This internal anomaly was misdiagnosed as a DDoS attack, which affected the processing time.