Multi node graceful failover + rebalance out fails with "Rebalance exited with reason {badarg,[{io_lib,format,["Rebalancing bucket ~p with config ~p"

Description

Script to Repro

guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,get-cbcollect-info=True,quota_percent=95,crash_warning=True,bucket_storage=magma,enable_dp=True,retry_get_process_num=200 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_rebalance_out,nodes_init=5,nodes_failover=2,bucket_spec=multi_bucket.buckets_for_rebalance_tests,data_load_stage=after,scrape_interval=2,rebalance_moves_per_node=64,skip_validations=False,GROUP=P0_failover_and_rebalance_out'

Steps to Repro
1. Create a 5 node cluster
2021-06-23 21:09:04,476 | test | INFO | pool-5-thread-6 | [table_view:display:72] Rebalance Overview
------------------------------------------------------------------------------

Nodes

Services

Version

CPU

Status

------------------------------------------------------------------------------

172.23.98.196	kv	7.1.0-1036-enterprise	5.39450466347	Cluster node
172.23.98.195	None			<--- IN ---
172.23.121.10	None			<--- IN ---
172.23.104.186	None			<--- IN ---
172.23.120.201	None			<--- IN ---

------------------------------------------------------------------------------

2. Create bucket/scopes/collections/data
2021-06-23 21:14:21,836 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
------------------------------------------------------------------------------------------------------------------------------

--------+------------------------------------+

Bucket

Type

Replicas

Durability

TTL

Items

RAM Quota

RAM Used

Disk Used

------------------------------------------------------------------------------------------------------------------------------

--------+------------------------------------+

4yZIQSEJJD6IzNnFF4kWkPgVRxWd0tC-1DBmRsF05LD1MVBbaPNwgg2vNjO8SCRd_syOC9QCrPVlNpP_x-15-414000

couchbase

none

3000000

10485760000

2613283152

8185163251

------------------------------------------------------------------------------------------------------------------------------

--------+------------------------------------+

3. Change few setting and graceful failover 2 nodes(172.23.104.186 and 172.23.120.201)

2021-06-23 21:14:29,920 | test  | INFO    | MainThread | [collections_rebalance:setUp:59] Changing scrape interval to 2
2021-06-23 21:14:32,463 | test  | INFO    | MainThread | [cluster_ready_functions:set_rebalance_moves_per_nodes:129] Changed Rebalance settings: {u'rebalanceMovesPerNode': 64}
2021-06-23 21:14:32,464 | test  | INFO    | MainThread | [collections_rebalance:load_collections_with_rebalance:932] Doing collection data load after graceful_failover_rebalance_out
2021-06-23 21:14:32,466 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:390] Starting rebalance operation of type : graceful_failover_rebalance_out
2021-06-23 21:14:32,467 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:602] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.201 port:8091 ssh_username:root]

4. Wait for graceful failover to complete

2021-06-23 21:21:03,559 | test  | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.104.186 - Node not part of cluster inactiveFailed
2021-06-23 21:21:03,561 | test  | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.120.201 - Node not part of cluster inactiveFailed

5. Rebalance out the node
2021-06-23 21:21:59,946 | test | INFO | pool-5-thread-11 | [table_view:display:72] Rebalance Overview
-------------------------------------------------------------------------------

Nodes

Services

Version

CPU

Status

-------------------------------------------------------------------------------

172.23.98.196	kv	7.1.0-1036-enterprise	10.5732484076	Cluster node
172.23.98.195	kv	7.1.0-1036-enterprise	24.7787610619	Cluster node
172.23.104.186	kv	7.1.0-1036-enterprise	1.00755667506	--- OUT --->
172.23.120.201	kv	7.1.0-1036-enterprise	0.752823086575	--- OUT --->
172.23.121.10	kv	7.1.0-1036-enterprise	12.2474747475	Cluster node

-------------------------------------------------------------------------------

6. Rebalance fails as in

cbcollect_info attached.

Labels

Environment

7.1.0-1036-enterprise

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

15 Jul 2021, 01:29 AM
14 Jul 2021, 12:09 PM
24 Jun 2021, 05:34 AM
24 Jun 2021, 05:32 AM

Activity

Balakumaran Gopal August 20, 2021 at 2:26 AM

Validated this on 7.0.2-6535 as well.

Balakumaran Gopal August 19, 2021 at 3:35 AM

Validated this on 7.0.1-6102.

CB robot August 12, 2021 at 3:59 AM

Build couchbase-server-7.1.0-1132 contains ns_server commit 2a578df with commit message:
https://couchbasecloud.atlassian.net/browse/MB-47087#icft=MB-47087 Change chars_limit default from 50K to -1

CB robot August 11, 2021 at 10:34 PM

Build couchbase-server-7.0.1-6004 contains ns_server commit 2a578df with commit message:
https://couchbasecloud.atlassian.net/browse/MB-47087#icft=MB-47087 Change chars_limit default from 50K to -1

Meni Hillel August 2, 2021 at 6:29 PM

In case someone is interested in assessing the blast radius - the circumstances under which this can happen can be quite random. First, a log line needs to exceed 50K, the chance of that happening is during rebalance, where we print vbmap, which is usually large, especially with large bucket names, and if there are/or large number nodes. Then, the log line truncation needs to happen right on an element of "map" data type. I'll guestimate that number of customers upgrading to 7.0.0 in production with these kind of setups is not high.

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Hareen Kancharla
Reporter
Balakumaran Gopal
Is this a Regression?
No
Triage
Untriaged
Operating System
Centos 64-bit
Story Points
1
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created June 24, 2021 at 5:33 AM

Updated August 30, 2021 at 9:28 PM

Resolved July 19, 2021 at 7:34 PM

Instabug

Multi node graceful failover + rebalance out fails with "Rebalance exited with reason {badarg,[{io_lib,format,["Rebalancing bucket ~p with config ~p"

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Attachments

Activity

Balakumaran Gopal August 20, 2021 at 2:26 AM

Balakumaran Gopal August 19, 2021 at 3:35 AM

CB robot August 12, 2021 at 3:59 AM

CB robot August 11, 2021 at 10:34 PM

Meni Hillel August 2, 2021 at 6:29 PM

DetailsAssigneeHareen KancharlaHareen KancharlaReporterBalakumaran GopalBalakumaran GopalIs this a Regression?NoTriageUntriagedOperating SystemCentos 64-bitStory Points1PriorityCriticalInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Operating System

Story Points

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Details
Assignee
Hareen Kancharla
Reporter
Balakumaran Gopal
Is this a Regression?
No
Triage
Untriaged
Operating System
Centos 64-bit
Story Points
1
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support