[Documentation]: Successfully acknowledged sync-write is missing from the bucket when rebalance failure is simulated via memcached kill..

Description

1. Create a 2 node cluster:

+----------------+----------+--------------+ | Nodes | Services | Status | +----------------+----------+--------------+ | 10.112.180.101 | [u'kv'] | Cluster node | | 10.112.180.102 | None | <--- IN --- | +----------------+----------+--------------+

2. Create bucket:

http://10.112.180.101:8091/pools/default/buckets with param: replicaIndex=1&maxTTL=0&flushEnabled=1&compressionMode=off&bucketType=membase&name=default&replicaNumber=1&ramQuotaMB=654&threadsNumber=3&evictionPolicy=valueOnly

3. Loaded 100k(test_docs-0:test_docs-99999) docs with durability=majority

4. Change bucket replica to 2, add 10.112.180.103, remove 10.112.180.102, hit rebalance. Load another 100k(test_docs-100000:test_docs-199999) in parallel
5. Kill memcahced on 10.112.180.101 when rebalance reaches ~40%. Rebalance failed(Intentionally)
Data loading is still in progress with expected exceptions.
6. Restart rebalance. Wait for rebalance finish and it finished properly.
7. Wait for data loading to finish and retry of all the catch exceptions succeeds.
8. Validate the data

Actual result:
Data validation failed as few keys are missing from which there was success for sync-write
Missing keys: ['test_docs-130287', 'test_docs-130289', 'test_docs-130282', 'test_docs-130294', 'test_docs-130291']

Expected Result:
All the data should be present as all the exceptions were watched and re-inserted.

In the attached pcap, apply the filter as: couchbase.opaque == 0xe3080000 and see packet number 619311 which is an insert request for key: test_docs-130287. Packet number 619329 is the success response for it.

But the key is missing from the bucket.

Note: Pcap is quite big, please apply the filters. I tried to save the filtered packets through wireshark but some issue is coming while doing that so couldn't do it.

QE Note:

-t rebalance_new.swaprebalancetests.SwapRebalanceFailedTests.test_failed_swap_rebalance,nodes_init=2,replicas=1,standard_buckets=1,num-swap=1,new_replica=2,percentage_progress=40,GROUP=P0;durability,durability=MAJORITY,skip_cleanup=True -p infra_log_level=debug,log_level=debug -m rest

Affects versions

Fix versions

Labels

Environment

6.5.0-4676

Link to Log File, atop/blg, CBCollectInfo, Core dump

https://cb-jira.s3.us-east-2.amazonaws.com/logs/DataMissing/collectinfo-2019-10-24T093355-ns_1%4010.112.180.101.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/DataMissing/collectinfo-2019-10-24T093355-ns_1%4010.112.180.103.zip

Release Notes Description

None

Attachments

6
  • 28 Oct 2019, 01:20 PM
  • 28 Oct 2019, 11:35 AM
  • 28 Oct 2019, 05:30 AM
  • 24 Oct 2019, 11:02 AM
  • 24 Oct 2019, 11:01 AM
  • 24 Oct 2019, 09:35 AM

Activity

Show:

Raju Suravarjjala November 12, 2019 at 6:56 PM

Bulk closing invalid, won-fix and duplicate bugs

Ritam Sharma November 5, 2019 at 1:02 PM

- We are seeing a different behaviour. Will log a new ticket with logs. Thank you

Dave Rigby November 5, 2019 at 12:47 PM

I believe that for an Ephemeral bucket, any SyncWrites which returned success to the client should not be lost if the active crashes. This is because auto-reprovisioning should promote one of the replicas (and the old active if/when it comes back will become a replica).

It wasn't clear from your comment if you are seeing this behaviour, or if you're seeing something different - if so please raise a separate MB on Ephemeral and we can investigate.

Ritesh Agarwal November 5, 2019 at 9:55 AM
Edited

/: What is expected in case of ephemeral bucket? Given that auto-reprovisioning is enabled.

Scenario:
For a given key for which active has responded back success to the client and then immediately active is killed. Prepare on replica has to be processed but i am seeing all those keys are getting lost. For ephemeral as there is no rollback involved in this case there should not be any data loss for acked keys.

Replica should commit all the Prepares it has acknowledged.

CC:

Dave Finlay October 31, 2019 at 5:17 AM

: basically yes. Our competitors offer this kind of trade off between performance and failure modes too. Obviously we do need to be clear with users in our docs on this trade off.

CC:

Not a Bug
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Unknown

Triage

Untriaged

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created October 24, 2019 at 9:38 AM
Updated November 12, 2019 at 6:56 PM
Resolved October 28, 2019 at 5:05 PM
Instabug

Flag notifications