PBM backup is not starting for 40 shard cluster.
General
Escalation
General
Escalation
Description
Environment
None
Activity
Rama Mekala 4 days ago
I did manually collect the collstats.
Rama Mekala 4 days ago
# pbm status
Cluster:
========
nls_lon_rs10:
- db10266.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20345.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20432.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs21:
- db20301.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10418.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20409.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs34:
- db20116.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10431.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20423.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs_38:
- db20107.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10436.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20429.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs7:
- db10263.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20342.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10444.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs4:
- db10260.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20312.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10441.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs26:
- db20306.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10423.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20415.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs25:
- db20305.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10422.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20414.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs_39:
- db20108.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10437.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20430.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs2:
- db20310.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10439.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10653.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs13:
- db20348.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20435.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20343.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs27:
- db20307.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10424.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20416.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs6:
- db10262.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20341.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10443.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs31:
- db20113.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10428.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20420.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs28:
- db20308.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10425.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20417.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs38:
- db20120.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10435.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20428.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs37:
- db20119.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10434.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20427.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs36:
- db20118.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10433.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20426.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs30:
- db20112.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10427.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20419.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs16:
- db20171.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10413.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20404.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs35:
- db20117.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10432.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20424.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs15:
- db10270.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20350.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20437.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs23:
- db20303.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10420.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20411.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs8:
- db10264.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10445.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10189.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs3:
- db20339.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20311.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10440.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs19:
- db20174.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10416.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20407.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs12:
- db10267.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20347.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20434.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs1:
- db20109.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20309.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10438.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs33:
- db20115.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10430.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db21279.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs5:
- db10261.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20340.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10442.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs29:
- db20111.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10426.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20418.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs32:
- db20114.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10429.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20421.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs14:
- db10269.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20349.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20436.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs18:
- db20173.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10415.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20406.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs17:
- db20172.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10414.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20405.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs20:
- db20175.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db10417.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20408.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs24:
- db20304.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10421.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20412.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_cfg:
- db20339.gbr2.omniture.com:27019 [S]: pbm-agent [v2.7.0] OK
- db10441.gbr2.omniture.com:27019 [P]: pbm-agent [v2.7.0] OK
- db10442.gbr2.omniture.com:27019 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs9:
- db10265.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20344.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20431.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs22:
- db20302.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db10419.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20410.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
nls_lon_rs11:
- db20110.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK
- db20346.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
- db20433.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK
PITR incremental backup:
========================
Status [OFF]
Currently running:
==================
(none)
Backups:
========
S3 us-east-1 https://gbr2-fs3-3.io.adobe.net/lon-nls-backup/LON1
Snapshots:
2025-04-01T19:40:48Z 0.00B <logical> [!canceled: 2025-04-01T23:10:14Z]
2025-04-01T19:34:03Z 0.00B <logical> [!canceled: 2025-04-01T19:35:37Z]
2025-04-01T19:21:09Z 0.00B <logical> [!canceled: 2025-04-01T19:26:20Z]
2025-04-01T18:44:41Z 0.00B <logical> [!canceled: 2025-04-01T18:56:23Z]
2025-03-31T17:25:12Z 0.00B <logical> [ERROR: couldn't get response from all shards: convergeClusterWithTimeout: 5h33m20s: reached converge timeout] [2025-03-31T22:58:34Z]
2025-03-31T05:24:40Z 0.00B <logical> [ERROR: couldn't get response from all shards: convergeClusterWithTimeout: 2h46m40s: reached converge timeout] [2025-03-31T08:11:22Z]
Rama Mekala 4 days ago
Here is the config:
# pbm config
storage:
type: s3
s3:
region: us-east-1
endpointUrl: https://gbr2-fs3-3.io.adobe.net
forcePathStyle: true
bucket: lon-nls-backup
prefix: LON1
credentials:
access-key-id: '***'
secret-access-key: '***'
maxUploadParts: 10000
storageClass: STANDARD
insecureSkipTLSVerify: false
retryer:
numMaxRetries: 2
minRetryDelay: 10s
maxRetryDelay: 10m0s
pitr:
enabled: false
compression: s2
backup:
oplogSpanMin: 0
timeouts:
startingStatus: 20000
compression: s2
restore: {}
Details
Assignee
UnassignedUnassignedReporter
Rama MekalaRama MekalaNeeds QA
YesPriority
Medium
Details
Details
Assignee
Unassigned
UnassignedReporter
Rama Mekala
Rama MekalaNeeds QA
Yes
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

Open Smart Checklist
Created 4 days ago
Updated 4 days ago
Hi, we have a 40-shard cluster with 72TB of data. PBM tends to work before. Now it is not working, throwing the below error. Help needed.
2025-04-01T19:34:04Z I [nls_lon_rs15/db20437.gbr2.omniture.com:27018] got command backup [name: 2025-04-01T19:34:03Z, compression: s2 (level: default)] <ts: 1743536043>, opid: 67ec3fab154091aeb13b0b33 2025-04-01T19:34:04Z I [nls_lon_rs15/db20437.gbr2.omniture.com:27018] got epoch {1743436516 1201} 2025-04-01T19:34:04Z I [nls_lon_rs22/db10419.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs8/db10264.gbr2.omniture.com:27018] got command backup [name: 2025-04-01T19:34:03Z, compression: s2 (level: default)] <ts: 1743536043>, opid: 67ec3fab154091aeb13b0b33 2025-04-01T19:34:04Z I [nls_lon_rs11/db20346.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs8/db10264.gbr2.omniture.com:27018] got epoch {1743436516 1201} 2025-04-01T19:34:04Z I [nls_lon_rs10/db10266.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs8/db10189.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs38/db20428.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs35/db10432.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs19/db20174.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs34/db20423.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs2/db10653.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs4/db10441.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs12/db20434.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs33/db21279.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs7/db10444.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs18/db20173.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs20/db20408.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs1/db20109.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs_38/db20429.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs15/db10270.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:35:00Z I [nls_lon_rs23/db20411.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] mark RS as error `get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u100019028": (CursorNotFound) unable to open cursor at URI statistics:table:collection-61111-8049180002435899248. reason: Too many open files`: <nil> 2025-04-01T19:35:00Z E [nls_lon_rs23/db20411.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup: get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u100019028": (CursorNotFound) unable to open cursor at URI statistics:table:collection-61111-8049180002435899248. reason: Too many open files 2025-04-01T19:35:08Z E [nls_lon_rs23/db20411.gbr2.omniture.com:27018] [agentCheckup] check node connection: connection(localhost:27018[-45]) incomplete read of message header: read tcp 127.0.0.1:39798->127.0.0.1:27018: read: connection reset by peer 2025-04-01T19:35:09Z E [nls_lon_rs8/db10189.gbr2.omniture.com:27018] [agentCheckup] check node connection: connection pool for localhost:27018 was cleared because another operation failed with: connection(localhost:27018[-401]) socket was unexpectedly closed: EOF: connection(localhost:27018[-401]) socket was unexpectedly closed: EOF 2025-04-01T19:35:09Z I [nls_lon_rs8/db10189.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] mark RS as error `get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u2902536_r1011": connection(localhost:27018[-396]) socket was unexpectedly closed: EOF`: <nil> 2025-04-01T19:35:09Z E [nls_lon_rs8/db10189.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup: get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u2902536_r1011": connection(localhost:27018[-396]) socket was unexpectedly closed: EOF 2025-04-01T19:35:20Z I [nls_lon_rs17/db20172.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] mark RS as error `get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u100010384_r110_m12204": (CursorNotFound) unable to open cursor at URI statistics:table:collection-60816-1338904761831786530. reason: Too many open files`: <nil> 2025-04-01T19:35:20Z E [nls_lon_rs17/db20172.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup: get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u100010384_r110_m12204": (CursorNotFound) unable to open cursor at URI statistics:table:collection-60816-1338904761831786530. reason: Too many open files 2025-04-01T19:35:31Z E [nls_lon_rs17/db20172.gbr2.omniture.com:27018] [agentCheckup] check node connection: connection(localhost:27018[-123]) incomplete read of message header: read tcp 127.0.0.1:39622->127.0.0.1:27018: read: connection reset by peer 2025-04-01T19:35:36Z I [nls_lon_rs2/db10653.gbr2.omniture.com:27018] got command cancelBackup <ts: 1743536136>, opid: 67ec400891f07becd0f62937 2025-04-01T19:35:36Z I [nls_lon_rs2/db10653.gbr2.omniture.com:27018] got epoch {1743436516 1201}