Multi node graceful failover + rebalance out fails with "Rebalance exited with reason {badarg,[{io_lib,format,["Rebalancing bucket ~p with config ~p"

Description

Script to Repro

guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,get-cbcollect-info=True,quota_percent=95,crash_warning=True,bucket_storage=magma,enable_dp=True,retry_get_process_num=200 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_rebalance_out,nodes_init=5,nodes_failover=2,bucket_spec=multi_bucket.buckets_for_rebalance_tests,data_load_stage=after,scrape_interval=2,rebalance_moves_per_node=64,skip_validations=False,GROUP=P0_failover_and_rebalance_out'

Steps to Repro
1. Create a 5 node cluster
2021-06-23 21:09:04,476 | test | INFO | pool-5-thread-6 | [table_view:display:72] Rebalance Overview
------------------------------------------------------------------------------

Nodes

Services

Version

CPU

Status

------------------------------------------------------------------------------

172.23.98.196

kv

7.1.0-1036-enterprise

5.39450466347

Cluster node

172.23.98.195

None

 

 

<--- IN ---

172.23.121.10

None

 

 

<--- IN ---

172.23.104.186

None

 

 

<--- IN ---

172.23.120.201

None

 

 

<--- IN ---

------------------------------------------------------------------------------

2. Create bucket/scopes/collections/data
2021-06-23 21:14:21,836 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
------------------------------------------------------------------------------------------------------------------------------


--------+------------------------------------+

Bucket

Type

Replicas

Durability

TTL

Items

RAM Quota

RAM Used

Disk Used

------------------------------------------------------------------------------------------------------------------------------


--------+------------------------------------+

4yZIQSEJJD6IzNnFF4kWkPgVRxWd0tC-1DBmRsF05LD1MVBbaPNwgg2vNjO8SCRd_syOC9QCrPVlNpP_x-15-414000

couchbase

2

none

0

3000000

10485760000

2613283152

8185163251

------------------------------------------------------------------------------------------------------------------------------


--------+------------------------------------+

3. Change few setting and graceful failover 2 nodes(172.23.104.186 and 172.23.120.201)

2021-06-23 21:14:29,920 | test | INFO | MainThread | [collections_rebalance:setUp:59] Changing scrape interval to 2 2021-06-23 21:14:32,463 | test | INFO | MainThread | [cluster_ready_functions:set_rebalance_moves_per_nodes:129] Changed Rebalance settings: {u'rebalanceMovesPerNode': 64} 2021-06-23 21:14:32,464 | test | INFO | MainThread | [collections_rebalance:load_collections_with_rebalance:932] Doing collection data load after graceful_failover_rebalance_out 2021-06-23 21:14:32,466 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:390] Starting rebalance operation of type : graceful_failover_rebalance_out 2021-06-23 21:14:32,467 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:602] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.201 port:8091 ssh_username:root]

4. Wait for graceful failover to complete

2021-06-23 21:21:03,559 | test | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.104.186 - Node not part of cluster inactiveFailed 2021-06-23 21:21:03,561 | test | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.120.201 - Node not part of cluster inactiveFailed

5. Rebalance out the node
2021-06-23 21:21:59,946 | test | INFO | pool-5-thread-11 | [table_view:display:72] Rebalance Overview
-------------------------------------------------------------------------------

Nodes

Services

Version

CPU

Status

-------------------------------------------------------------------------------

172.23.98.196

kv

7.1.0-1036-enterprise

10.5732484076

Cluster node

172.23.98.195

kv

7.1.0-1036-enterprise

24.7787610619

Cluster node

172.23.104.186

kv

7.1.0-1036-enterprise

1.00755667506

--- OUT --->

172.23.120.201

kv

7.1.0-1036-enterprise

0.752823086575

--- OUT --->

172.23.121.10

kv

7.1.0-1036-enterprise

12.2474747475

Cluster node

-------------------------------------------------------------------------------

6. Rebalance fails as in

.

cbcollect_info attached.

Components

Affects versions

Fix versions

Labels

Environment

7.1.0-1036-enterprise

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

4
  • 15 Jul 2021, 01:29 AM
  • 14 Jul 2021, 12:09 PM
  • 24 Jun 2021, 05:34 AM
  • 24 Jun 2021, 05:32 AM

Activity

Balakumaran Gopal August 20, 2021 at 2:26 AM

Validated this on 7.0.2-6535 as well.

Balakumaran Gopal August 19, 2021 at 3:35 AM

Validated this on 7.0.1-6102.

CB robot August 12, 2021 at 3:59 AM

Build couchbase-server-7.1.0-1132 contains ns_server commit 2a578df with commit message:
https://couchbasecloud.atlassian.net/browse/MB-47087#icft=MB-47087 Change chars_limit default from 50K to -1

CB robot August 11, 2021 at 10:34 PM

Build couchbase-server-7.0.1-6004 contains ns_server commit 2a578df with commit message:
https://couchbasecloud.atlassian.net/browse/MB-47087#icft=MB-47087 Change chars_limit default from 50K to -1

Meni Hillel August 2, 2021 at 6:29 PM

In case someone is interested in assessing the blast radius - the circumstances under which this can happen can be quite random. First, a log line needs to exceed 50K, the chance of that happening is during rebalance, where we print vbmap, which is usually large, especially with large bucket names, and if there are/or large number nodes. Then, the log line truncation needs to happen right on an element of "map" data type. I'll guestimate that number of customers upgrading to 7.0.0 in production with these kind of setups is not high.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

No

Triage

Untriaged

Operating System

Centos 64-bit

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created June 24, 2021 at 5:33 AM
Updated August 30, 2021 at 9:28 PM
Resolved July 19, 2021 at 7:34 PM
Instabug