no successful backup on 40 shard cluster

Description

We have 40 shard cluster/ 120 nodes with Mongo community edition ver: v4.2.13.

PBM status always complaining ERROR with ERROR: lost agent, last heartbeat: 1656631197 on or the other pbm-agent. 

Never had succesful backup. this is cluster have 1TB dataset. I have other with 40tb. 

$ pbm status Cluster: ======== config:   - config/db11247.or1.omniture.com:27019: pbm-agent v1.8.0 OK   - config/db40590.or1.omniture.com:27019: pbm-agent v1.8.0 OK   - config/db40624.or1.omniture.com:27019: pbm-agent v1.8.0 OK sh_0:   - sh_0/db11242.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_0/db40585.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_0/db40619.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627488 sh_1:   - sh_1/db11243.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_1/db40586.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_1/db40620.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_2:   - sh_2/db11244.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_2/db40587.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_2/db40621.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_3:   - sh_3/db11245.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_3/db40588.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_3/db40622.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_4:   - sh_4/db11246.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_4/db40589.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_4/db40623.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_5:   - sh_5/db11247.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627484   - sh_5/db40590.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_5/db40624.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_6:   - sh_6/db11248.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627141   - sh_6/db40591.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_6/db40625.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_7:   - sh_7/db11249.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627494   - sh_7/db40592.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_7/db40626.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_8:   - sh_8/db11250.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_8/db40593.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_8/db40627.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656626943 sh_9:   - sh_9/db11251.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_9/db40594.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_9/db40628.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_10:   - sh_10/db11252.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627479   - sh_10/db40595.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_10/db40629.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_11:   - sh_11/db11253.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_11/db40596.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_11/db40630.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_12:   - sh_12/db11254.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_12/db40597.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_12/db40631.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_13:   - sh_13/db11256.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_13/db40598.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_13/db40632.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_14:   - sh_14/db11257.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_14/db40599.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_14/db40633.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_15:   - sh_15/db11259.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_15/db40600.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_15/db40634.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_16:   - sh_16/db11262.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_16/db40601.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_16/db40635.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_17:   - sh_17/db11263.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_17/db40602.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_17/db40636.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_18:   - sh_18/db31255.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_18/db40603.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_18/db40637.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_19:   - sh_19/db31256.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_19/db40604.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_19/db40638.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_20:   - sh_20/db31257.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656626911   - sh_20/db40605.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_20/db40639.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_21:   - sh_21/db31258.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627461   - sh_21/db40606.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_21/db40640.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_22:   - sh_22/db31260.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656626909   - sh_22/db40607.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_22/db40642.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_23:   - sh_23/db31261.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_23/db40608.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_23/db40643.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_24:   - sh_24/db31262.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_24/db40609.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_24/db40644.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_25:   - sh_25/db31263.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_25/db40610.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_25/db40646.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_26:   - sh_26/db31264.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_26/db40611.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_26/db40647.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_27:   - sh_27/db31267.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627427   - sh_27/db40612.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_27/db40648.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_28:   - sh_28/db31268.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_28/db40613.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_28/db40649.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_29:   - sh_29/db31205.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_29/db40614.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_29/db40650.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_30:   - sh_30/db31206.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_30/db40615.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_30/db40651.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_31:   - sh_31/db31207.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_31/db40616.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_31/db40652.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656626908 sh_32:   - sh_32/db31208.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_32/db40617.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_32/db50436.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627470 sh_33:   - sh_33/db31211.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627481   - sh_33/db40618.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_33/db50437.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_34:   - sh_34/db31212.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656627467   - sh_34/db50406.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_34/db50438.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_35:   - sh_35/db31213.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_35/db50407.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_35/db50439.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_36:   - sh_36/db31232.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_36/db50408.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_36/db50440.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656626675 sh_37:   - sh_37/db31235.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_37/db50409.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_37/db50441.or1.omniture.com:27018: pbm-agent v1.8.0 OK sh_38:   - sh_38/db31237.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_38/db50410.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_38/db50442.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656626878 sh_39:   - sh_39/db31239.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_39/db50411.or1.omniture.com:27018: pbm-agent v1.8.0 OK   - sh_39/db50443.or1.omniture.com:27018: pbm-agent  FAILED status:       > ERROR with ERROR: lost agent, last heartbeat: 1656626909 PITR incremental backup: ======================== Status [OFF]Currently running: ================== (none) Backups: ======== FS  /nfs/mongo/backups/NLS5   Snapshots:     2022-06-30T23:05:27Z 0.00B <logical> [ERROR: check cluster for dump done: convergeCluster: backup on shard sh_36 failed with: %!s(<nil>)] [2022-06-30T23:19:41Z]     2022-06-30T21:51:36Z 0.00B <logical> [ERROR: check cluster for dump done: convergeCluster: lost shard sh_36, last beat ts: 1656626677] [2022-06-30T22:05:08Z]     2022-06-30T21:27:35Z 0.00B <logical> [ERROR: check cluster for dump done: convergeCluster: backup on shard sh_34 failed with: %!s(<nil>)] [2022-06-30T21:41:08Z]     2022-06-30T20:44:23Z 0.00B <logical> [ERROR: check cluster for dump done: convergeCluster: backup on shard sh_1 failed with: %!s(<nil>)] [2022-06-30T21:09:43Z]     2022-06-30T20:07:51Z 0.00B <logical> [ERROR: check cluster for dump done: convergeCluster: lost shard sh_4, last beat ts: 1656620402] [2022-06-30T20:20:33Z]

Environment

test

Smart Checklist

Activity

Aaditya Dubey December 10, 2023 at 8:36 AM

Hi ,

Closing the report, no activity for a long!

Aaditya Dubey January 27, 2023 at 2:12 PM

Hi ,

Thank you for the report.
Please let me know if issue is still there.

Rama Mekala July 20, 2022 at 9:03 PM

Any help will be much appreciated

Rama Mekala July 15, 2022 at 2:37 PM

Any help on this issue? I posted error message above as requested.

Rama Mekala July 7, 2022 at 3:06 PM

Here is the other output:

$ pbm logs -sD -t0 -x Error: get logs: get list from mongo: (OperationFailed) Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.
Incomplete

Details

Assignee

Reporter

Planned Version/s

Components

Affects versions

Priority

Smart Checklist

Created June 30, 2022 at 11:49 PM
Updated February 4, 2025 at 11:16 AM
Resolved December 10, 2023 at 8:36 AM

Flag notifications