Rust just killed the internet

874 words

4 minutes

Rust just killed the internet

2025-11-18

Engineering

Incident Analysis

/

QA Engineering

/

Testing Strategy

On November 18, 2025, Cloudflare experienced a six-hour outage affecting their core CDN services. The root cause was a Rust panic in their FL2 proxy system when a Bot Management configuration file exceeded expected limits. As a QA engineer, this incident highlights several testing gaps that could have caught this issue before production.

What Happened#

At 11:05 UTC, Cloudflare deployed a ClickHouse database permissions change. This change caused a SQL query to return duplicate rows, which pushed the Bot Management feature count from ~60 to over 200. The FL2 proxy had a hard limit of 200 features with preallocated memory, and when this limit was exceeded, the code called unwrap() on an Err value, causing a panic.

The problematic query:

1
SELECT name, type FROM system.columns
2
WHERE table = 'http_requests_features'
3
ORDER BY name;

This query lacked a database filter, assuming it would only return columns from the default database. After the permissions change, it also returned columns from the r0 database.

The Rust code that panicked:

1
// Simplified version of the error
2
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

Testing Gaps#

1. Missing Integration Tests for Database Changes#

The ClickHouse permissions change was deployed without integration tests that validated the Bot Management feature file generation. A proper test suite should have included:

1
#[test]
2
fn test_feature_file_generation_with_new_permissions() {
3
    // Simulate the new permission model
4
    let query_result = run_query_with_new_permissions();
5

6
    // Validate the feature count stays within limits
7
    assert!(query_result.features.len() <= 200);
8

9
    // Validate no duplicate features
10
    let unique_features: HashSet<_> = query_result.features.iter().collect();
11
    assert_eq!(unique_features.len(), query_result.features.len());
12
}

2. No Boundary Testing for Configuration Limits#

The system had a hard limit of 200 features but was only using ~60. This 3x buffer seemed safe, but there were no tests validating behavior at or near the limit:

1
#[test]
2
fn test_feature_limit_boundary() {
3
    // Test at exactly 200 features
4
    let features = generate_features(200);
5
    assert!(load_features(features).is_ok());
6

7
    // Test at 201 features (should handle gracefully)
8
    let features = generate_features(201);
9
    match load_features(features) {
10
        Ok(_) => panic!("Should have returned an error"),
11
        Err(e) => {
12
            // Should log error and use fallback, not panic
13
            assert!(!e.is_panic());
14
        }
15
    }
16
}

3. Improper Error Handling#

The code used unwrap() instead of proper error handling. This should have been caught in code review, but automated testing can also help:

1
// Bad: panics on error
2
let features = parse_features(data).unwrap();
3

4
// Good: handles error gracefully
5
let features = match parse_features(data) {
6
    Ok(f) => f,
7
    Err(e) => {
8
        log::error!("Failed to parse features: {}", e);
9
        return default_features();
10
    }
11
};

A linting rule or custom clippy lint could flag all unwrap() calls in production code paths.

4. Lack of Chaos Engineering#

The intermittent nature of the failure (good file, then bad file every 5 minutes) suggests the system wasn’t tested under gradual rollout conditions. Chaos engineering tests could simulate:

Partial cluster updates
Inconsistent data across nodes
Configuration file corruption
Resource exhaustion scenarios

5. Missing Contract Tests for SQL Queries#

The SQL query had an implicit contract: “return columns only from the default database.” This contract wasn’t documented or tested. A contract test would look like:

1
#[test]
2
fn test_feature_query_returns_only_default_database() {
3
    let result = query_feature_columns();
4

5
    for row in result {
6
        assert_eq!(row.database, "default",
7
            "Query should only return columns from default database");
8
    }
9
}

Prevention Strategies#

1. Validate All Configuration Inputs#

Even internally generated configuration should be validated:

1
fn load_bot_features(config: &FeatureConfig) -> Result<Features, FeatureError> {
2
    // Validate before processing
3
    if config.features.len() > MAX_FEATURES {
4
        return Err(FeatureError::TooManyFeatures {
5
            count: config.features.len(),
6
            max: MAX_FEATURES,
7
        });
8
    }
9

10
    // Check for duplicates
11
    let unique_count = config.features.iter().collect::<HashSet<_>>().len();
12
    if unique_count != config.features.len() {
13
        return Err(FeatureError::DuplicateFeatures);
14
    }
15

16
    // Process features
17
    Ok(process_features(config))
18
}

2. Implement Gradual Rollout with Automated Rollback#

The configuration file was deployed globally every 5 minutes. A better approach:

Deploy to canary servers first
Monitor error rates and latency
Automatically rollback if thresholds are exceeded
Only proceed to full deployment after canary validation

3. Add Property-Based Testing#

Property-based testing could have caught the duplicate row issue:

1
#[quickcheck]
2
fn feature_query_returns_unique_rows(db_state: DatabaseState) -> bool {
3
    let result = query_feature_columns(&db_state);
4
    let unique_count = result.iter().collect::<HashSet<_>>().len();
5
    unique_count == result.len()
6
}

4. Enforce Error Handling Standards#

Add CI checks to prevent unwrap() in critical paths:

1
# In CI pipeline
2
cargo clippy -- -D clippy::unwrap_used -D clippy::expect_used

Or use a more nuanced approach with #[deny] attributes on specific modules:

1
#![deny(clippy::unwrap_used)]
2
#![deny(clippy::expect_used)]

5. Load Testing with Realistic Data Variations#

Load tests should include:

Configuration files at various sizes (10%, 50%, 90%, 100%, 110% of limits)
Malformed configuration data
Unexpected data types or structures
Concurrent configuration updates

6. Better Observability in Testing#

The system had observability in production, but tests should also validate:

Error logging works correctly
Metrics are emitted for configuration loads
Alerts trigger at appropriate thresholds

1
#[test]
2
fn test_feature_limit_exceeded_emits_metric() {
3
    let metrics = MetricsCollector::new();
4
    let features = generate_features(201);
5

6
    let _ = load_features(features);
7

8
    assert!(metrics.has_metric("bot_features.limit_exceeded"));
9
}

Key Takeaways for QA Engineers#

Test database schema changes: Any database migration or permissions change should have integration tests validating dependent queries.
Boundary testing is critical: Don’t just test happy paths. Test at limits, beyond limits, and with malformed data.
Validate error handling: Ensure errors are handled gracefully, not with panics or unwraps.
Test gradual rollouts: Simulate partial deployments and inconsistent states across clusters.
Automate configuration validation: Treat configuration as code and validate it with the same rigor.
Document implicit contracts: If code assumes certain data shapes or limits, document and test those assumptions.

The Cloudflare incident wasn’t caused by a lack of testing infrastructure. They have sophisticated monitoring, gradual rollouts, and multiple environments. The gap was in testing the interaction between systems during changes. As QA engineers, our job is to think about these edge cases and build test suites that catch them before they reach production.

Based on Cloudflare’s incident report published on November 18, 2025.