ActiveStream Monotonic(lastReadSeqno) failure: might prematurely send a SeqnoAdvance on filtered streams for memory snapshots for the first SnapshotMarker

Description

A collection-filtered stream which is streaming an open checkpoint with the shape: [e:N+1, cs:N+1, set_vbucket_state:N+1, ...].
And which is opened with a snapshot:{N, N+1}, will observe a "gap" when the checkpoint processor runs.

This is because the stream will have snap_end_seqno_ = N+1 as created from the StreamRequest, and the processItems() will see the set_vbucket_state at the same seqno and queue a SeqnoAdvanced to "complete" the snapshot by calling sendSnapshotAndSeqnoAdvanced().

This only happens when this is the first marker to be sent for that stream and when that gap is up to the snap_end_seqno of the request.

Since that checkpoint is open, we can queue a mutation, which will have the same seqno:N+1 (as meta items have the same seqno as the next mutation).

Processing that mutation causes a break in the monotonicity of lastReadSeqno (it sees N+1 twice).

Scenario 1

1. Client is on seqno 16 and requests StreamRequest{start:16, end:MAX, snapStart:16, snapEnd:16} and vBucket highSeqno is at 16.
2. Then, KV processes mutation:17 and immediately sends SnapshotMarker{start:16, end:17}. It starts at 16 because it is the first marker and in that case we start at the snapStart in the StreamRequest.
3. Then, KV crashes and fails to persist 17. Either the same vBucket comes back online as active or replica is promoted at seqno 16.
4. The client reconnects with StreamRequest{start:16,  end:-1, snapStart:16, snapEnd:17}. The snapEnd is what is seen in the SnapshotMarker. KV doesn't have the mutation at 17, but it has something else - set_vbucket_state. So in this case, it sends the first marker and immediate SeqnoAdvanced:17, and updates the Monotonic lastReadSeqno of the stream to 17.
5. Then KV processes a mutation:17 and that breaks the monotonicity of lastReadSeqno (next:17 == current:17).

Scenario 2

Similar to the above, but the snapEnd is more than 1 above the vBucket highSeqno.

1. Client is on seqno 16.
2. KV processes mutations 17-27 and immediately sends SnapshotMarker{start:17, end:27}.
3. Then, KV crashes and fails to persist 17-27. Either the same vBucket comes back online as active or replica is promoted at seqno 17.
4. The client reconnects with StreamRequest{start:17,  end:-1, snapStart:17, snapEnd:27}. The snapEnd is what is seen in the SnapshotMarker. KV doesn't have the mutations 17-27. When a mutation at seqno 17 becomes available, if it is not matched by the ActiveStream filter, we trigger the same code path to send SeqnoAdvanced(snapEnd). So in this case, it sends the first marker and immediate SeqnoAdvanced:27, and updates the Monotonic lastReadSeqno of the stream to 27.
5. Then KV processes a mutation:18 and that breaks the monotonicity of lastReadSeqno (next:18 < current:27).

These are demonstrated by https://review.couchbase.org/c/kv_engine/+/213654.

Release Notes

Issue

Resolution

After a hard failover or a crash with data loss of the Data Service, a DCP client (affects ElasticSearch and Kafka Connectors) which has seen a partial snapshot can trigger an edge case which causes repeated Data Service crashes.

This edge case has been fixed.

Components

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Activity

Show:

CB Robot 
February 25, 2025 at 2:03 AM

Build couchbase-columnar-1.2.0-1009 contains kv_engine commit 26c0f65 with commit message:
MB-62984: Merge 'couchbase/trinity' into 'couchbase/cypher'

CB Robot 
February 25, 2025 at 1:30 AM

Build couchbase-columnar-1.2.0-1009 contains kv_engine commit 26c0f65 with commit message:
MB-62984: Merge 'couchbase/trinity' into 'couchbase/cypher'

CB Robot 
February 24, 2025 at 11:35 PM

Build couchbase-columnar-1.2.0-1009 contains kv_engine commit 26c0f65 with commit message:
MB-62984: Merge 'couchbase/trinity' into 'couchbase/cypher'

CB Robot 
February 24, 2025 at 9:57 PM

Build couchbase-columnar-1.2.0-1009 contains kv_engine commit 26c0f65 with commit message:
MB-62984: Merge 'couchbase/trinity' into 'couchbase/cypher'

CB Robot 
February 24, 2025 at 8:39 PM

Build couchbase-columnar-1.2.0-1009 contains kv_engine commit 26c0f65 with commit message:
MB-62984: Merge 'couchbase/trinity' into 'couchbase/cypher'

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Gerrit Reviews

[[BP] : Reset the snapEndSeqno|https://review.couchbase.org/c/kv_engine/+/218145] repo:kv_engine branch:7.2.2
[MB-62984: [BP] Reset the snapEndSeqno|https://review.couchbase.org/c/kv_engine/+/214693] repo:kv_engine branch:7.6.2

Triage

Issue Impact

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created August 1, 2024 at 3:12 PM
Updated March 21, 2025 at 2:45 AM
Resolved August 15, 2024 at 1:34 PM
Instabug