Issues

Select view

Select search mode

 
50 of

PBM backup is not starting for 40 shard cluster.

Description

Hi, we have a 40-shard cluster with 72TB of data. PBM tends to work before. Now it is not working, throwing the below error. Help needed.


2025-04-01T19:34:04Z I [nls_lon_rs15/db20437.gbr2.omniture.com:27018] got command backup [name: 2025-04-01T19:34:03Z, compression: s2 (level: default)] <ts: 1743536043>, opid: 67ec3fab154091aeb13b0b33 2025-04-01T19:34:04Z I [nls_lon_rs15/db20437.gbr2.omniture.com:27018] got epoch {1743436516 1201} 2025-04-01T19:34:04Z I [nls_lon_rs22/db10419.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs8/db10264.gbr2.omniture.com:27018] got command backup [name: 2025-04-01T19:34:03Z, compression: s2 (level: default)] <ts: 1743536043>, opid: 67ec3fab154091aeb13b0b33 2025-04-01T19:34:04Z I [nls_lon_rs11/db20346.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs8/db10264.gbr2.omniture.com:27018] got epoch {1743436516 1201} 2025-04-01T19:34:04Z I [nls_lon_rs10/db10266.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs8/db10189.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs38/db20428.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs35/db10432.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs19/db20174.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs34/db20423.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:04Z I [nls_lon_rs2/db10653.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs4/db10441.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs12/db20434.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs33/db21279.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs7/db10444.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs18/db20173.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs20/db20408.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs1/db20109.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs_38/db20429.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:34:05Z I [nls_lon_rs15/db10270.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup started 2025-04-01T19:35:00Z I [nls_lon_rs23/db20411.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] mark RS as error `get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u100019028": (CursorNotFound) unable to open cursor at URI statistics:table:collection-61111-8049180002435899248. reason: Too many open files`: <nil> 2025-04-01T19:35:00Z E [nls_lon_rs23/db20411.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup: get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u100019028": (CursorNotFound) unable to open cursor at URI statistics:table:collection-61111-8049180002435899248. reason: Too many open files 2025-04-01T19:35:08Z E [nls_lon_rs23/db20411.gbr2.omniture.com:27018] [agentCheckup] check node connection: connection(localhost:27018[-45]) incomplete read of message header: read tcp 127.0.0.1:39798->127.0.0.1:27018: read: connection reset by peer 2025-04-01T19:35:09Z E [nls_lon_rs8/db10189.gbr2.omniture.com:27018] [agentCheckup] check node connection: connection pool for localhost:27018 was cleared because another operation failed with: connection(localhost:27018[-401]) socket was unexpectedly closed: EOF: connection(localhost:27018[-401]) socket was unexpectedly closed: EOF 2025-04-01T19:35:09Z I [nls_lon_rs8/db10189.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] mark RS as error `get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u2902536_r1011": connection(localhost:27018[-396]) socket was unexpectedly closed: EOF`: <nil> 2025-04-01T19:35:09Z E [nls_lon_rs8/db10189.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup: get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u2902536_r1011": connection(localhost:27018[-396]) socket was unexpectedly closed: EOF 2025-04-01T19:35:20Z I [nls_lon_rs17/db20172.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] mark RS as error `get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u100010384_r110_m12204": (CursorNotFound) unable to open cursor at URI statistics:table:collection-60816-1338904761831786530. reason: Too many open files`: <nil> 2025-04-01T19:35:20Z E [nls_lon_rs17/db20172.gbr2.omniture.com:27018] [backup/2025-04-01T19:34:03Z] backup: get namespaces size: collStats "config.cache.chunks.lookupservice.lookups_u100010384_r110_m12204": (CursorNotFound) unable to open cursor at URI statistics:table:collection-60816-1338904761831786530. reason: Too many open files 2025-04-01T19:35:31Z E [nls_lon_rs17/db20172.gbr2.omniture.com:27018] [agentCheckup] check node connection: connection(localhost:27018[-123]) incomplete read of message header: read tcp 127.0.0.1:39622->127.0.0.1:27018: read: connection reset by peer 2025-04-01T19:35:36Z I [nls_lon_rs2/db10653.gbr2.omniture.com:27018] got command cancelBackup <ts: 1743536136>, opid: 67ec400891f07becd0f62937 2025-04-01T19:35:36Z I [nls_lon_rs2/db10653.gbr2.omniture.com:27018] got epoch {1743436516 1201}

Environment

None

Details

Assignee

Reporter

Needs QA

Yes

Priority

Smart Checklist

Created 20 hours ago
Updated 20 hours ago

Activity

Show:

Rama Mekala20 hours ago

I did manually collect the collstats.

Rama Mekala20 hours ago

# pbm status Cluster: ======== nls_lon_rs10: - db10266.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20345.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20432.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs21: - db20301.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10418.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20409.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs34: - db20116.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10431.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20423.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs_38: - db20107.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10436.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20429.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs7: - db10263.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20342.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10444.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs4: - db10260.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20312.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10441.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs26: - db20306.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10423.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20415.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs25: - db20305.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10422.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20414.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs_39: - db20108.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10437.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20430.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs2: - db20310.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10439.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10653.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs13: - db20348.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20435.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20343.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs27: - db20307.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10424.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20416.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs6: - db10262.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20341.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10443.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs31: - db20113.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10428.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20420.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs28: - db20308.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10425.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20417.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs38: - db20120.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10435.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20428.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs37: - db20119.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10434.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20427.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs36: - db20118.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10433.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20426.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs30: - db20112.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10427.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20419.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs16: - db20171.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10413.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20404.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs35: - db20117.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10432.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20424.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs15: - db10270.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20350.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20437.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs23: - db20303.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10420.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20411.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs8: - db10264.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10445.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10189.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs3: - db20339.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20311.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10440.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs19: - db20174.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10416.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20407.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs12: - db10267.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20347.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20434.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs1: - db20109.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20309.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10438.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs33: - db20115.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10430.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db21279.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs5: - db10261.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20340.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10442.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs29: - db20111.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10426.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20418.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs32: - db20114.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10429.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20421.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs14: - db10269.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20349.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20436.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs18: - db20173.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10415.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20406.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs17: - db20172.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10414.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20405.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs20: - db20175.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db10417.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20408.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs24: - db20304.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10421.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20412.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_cfg: - db20339.gbr2.omniture.com:27019 [S]: pbm-agent [v2.7.0] OK - db10441.gbr2.omniture.com:27019 [P]: pbm-agent [v2.7.0] OK - db10442.gbr2.omniture.com:27019 [S]: pbm-agent [v2.7.0] OK nls_lon_rs9: - db10265.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20344.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20431.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs22: - db20302.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db10419.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20410.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK nls_lon_rs11: - db20110.gbr2.omniture.com:27018 [P]: pbm-agent [v2.7.0] OK - db20346.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK - db20433.gbr2.omniture.com:27018 [S]: pbm-agent [v2.7.0] OK PITR incremental backup: ======================== Status [OFF] Currently running: ================== (none) Backups: ======== S3 us-east-1 https://gbr2-fs3-3.io.adobe.net/lon-nls-backup/LON1 Snapshots: 2025-04-01T19:40:48Z 0.00B <logical> [!canceled: 2025-04-01T23:10:14Z] 2025-04-01T19:34:03Z 0.00B <logical> [!canceled: 2025-04-01T19:35:37Z] 2025-04-01T19:21:09Z 0.00B <logical> [!canceled: 2025-04-01T19:26:20Z] 2025-04-01T18:44:41Z 0.00B <logical> [!canceled: 2025-04-01T18:56:23Z] 2025-03-31T17:25:12Z 0.00B <logical> [ERROR: couldn't get response from all shards: convergeClusterWithTimeout: 5h33m20s: reached converge timeout] [2025-03-31T22:58:34Z] 2025-03-31T05:24:40Z 0.00B <logical> [ERROR: couldn't get response from all shards: convergeClusterWithTimeout: 2h46m40s: reached converge timeout] [2025-03-31T08:11:22Z]

Rama Mekala20 hours ago

Here is the config:


# pbm config storage: type: s3 s3: region: us-east-1 endpointUrl: https://gbr2-fs3-3.io.adobe.net forcePathStyle: true bucket: lon-nls-backup prefix: LON1 credentials: access-key-id: '***' secret-access-key: '***' maxUploadParts: 10000 storageClass: STANDARD insecureSkipTLSVerify: false retryer: numMaxRetries: 2 minRetryDelay: 10s maxRetryDelay: 10m0s pitr: enabled: false compression: s2 backup: oplogSpanMin: 0 timeouts: startingStatus: 20000 compression: s2 restore: {}