Eventing consumes large amount of CPU with no functions.

Description

When doing some tests on a 3 x t5.2xlarge AWS cluster I noticed that a set of symmetric servers (Data, Query, Index, Eventing) with default memory quotas have excessive CPU utilization when completely idle on two out of the three nodes. I am running Enterprise Edition 7.0.2 build 6683

Each node is a r5.2xlarge: 64 GiB of memory, 8 vCPUs, 64-bit platform

I created 20 buckets (default scope and default collection) loaded 50K small documents in each bucket and made a primary index in each.

There has never been an Eventing Function configured (nor does one exist in the Eenting UI) on any of the nodes, it seems like on two (2) of the nodes the "eventing-producer" and "beam.smp" interact adversely when they shouldn't. The first node (10.21.24.37) looks correct but the next two nodes (10.21.25.181 and 10.21.26.101) appear to have way too much CPU burned doing absolutely nothing the these nodes are both above 84% CPU utilization(while the first node is under 7%).

There is no issue if I drop Eventing as a Service from every node and re-run the exact same test (Data, Query, Index) there is no issue 20 buckets (default scope and default collection) loaded 50K small documents in each bucket and made a primary index in each. The result is every node looks the same in the idle state all measuring under 10% CPU utilization (9.3% 7.8% and 7.6%) see picture "compare_with_eventing_and_without_eventing.JPG"

ec2-user@ec2-15-223-36-143.ca-central-1.compute.amazonaws.com

private IP 10.21.24.37

ec2-user@ec2-3-99-49-144.ca-central-1.compute.amazonaws.com

private IP 10.21.25.181

ec2-user@ec2-15-223-36-53.ca-central-1.compute.amazonaws.com

private IP 10.21.26.101

I have attached CPU utilization pictures from both AWS and the Couchbase UI.

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

Activity

Show:

Sujay Gad October 7, 2021 at 5:17 AM
Edited

Verified the fix on 7.0.2-6700, 7.1.0-1429.

STEPS

Create a cluster having 3 nodes with kv, index, query and eventing services colocated on each node.
Create 15 buckets each having 100MB RAM quota.
Delete and recreate all 15 buckets in quick succession.
Check CPU utilisation on all 3 nodes.

CASE A
Reproduced the issue on 7.0.2-6698.
CPU utilisation remains high on all 3 nodes after deletion and recreation of buckets.

CASE B
Verified the fix on 7.0.2-6700.
CPU utilisation was high only for a brief moment during bucket creation.

CASE C
Verified the fix on 7.1.0-1429.
CPU utilisation was high only for a brief moment during bucket creation.

CB robot October 5, 2021 at 4:36 PM

Build couchbase-server-7.0.2-6700 contains eventing commit 3c24dc9 with commit message:
: Fix goroutine leak due to bucket delete and recreate

Jon Strabala October 4, 2021 at 1:38 PM
Edited

Jeelan and Rita’s the problem still occurs if I add a 65 second delay between the CRUD operations (I showed this in my prior tests above) and just adding buckets with no deletions.

So it his not dependent on “quick” CLI commands (although that does lower the threshold by a few buckets). Also once the high beam eventing-producer CPU issue occurs there seems to be no way to unwind other than removing the Eventing Service nodes and rebalancing or stopping and restarting every node (or deleting all my buckets I believe that I had to drop them all to stop the HTTP traffic and lower the CPU)

Maybe there are other work arounds or avoidance techniques like create the cluster KV nodes first then add your buckets the finally add the Eventing service (not sure as I haven’t tested this)

So 6.5.1 through 7.0.1 works with 30 buckets but if you use Eventing in 7.0.2 at 13 buckets no matter how careful you are your system goes into a busy spin. I also envision that customers with 15+ buckets that use Eventing will consistently run into this when they configure their test clusters.

CB robot October 4, 2021 at 7:59 AM

Build couchbase-server-7.1.0-1411 contains eventing commit 6fd5212 with commit message:
: Fix goroutine leak due to bucket delete and recreate

Jeelan Poola October 4, 2021 at 6:45 AM

Agree . Marking it for releasenote in 7.0.2. Also lowering the priority as it is not a common 80% use case. And there is an easy work around (delete the bkt and wait for a min or so).

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Sujay Gad
Reporter
Jon Strabala(Deactivated)
Is this a Regression?
Yes
Triage
Untriaged
Story Points
1
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created September 30, 2021 at 9:20 PM

Updated October 7, 2021 at 5:18 AM

Resolved October 5, 2021 at 4:07 PM

Instabug

Eventing consumes large amount of CPU with no functions.

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Attachments

Activity

Sujay Gad October 7, 2021 at 5:17 AMEdited

CB robot October 5, 2021 at 4:36 PM

Jon Strabala October 4, 2021 at 1:38 PMEdited

CB robot October 4, 2021 at 7:59 AM

Jeelan Poola October 4, 2021 at 6:45 AM

DetailsAssigneeSujay GadSujay GadReporterJon StrabalaJon Strabala(Deactivated)Is this a Regression?YesTriageUntriagedStory Points1PriorityCriticalInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Story Points

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Sujay Gad October 7, 2021 at 5:17 AM
Edited

Jon Strabala October 4, 2021 at 1:38 PM
Edited

Details
Assignee
Sujay Gad
Reporter
Jon Strabala(Deactivated)
Is this a Regression?
Yes
Triage
Untriaged
Story Points
1
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support