Fix duplicate peer in createNewRegionPeer; harden reconstruct IT#17710
Merged
Conversation
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17710 +/- ##
============================================
- Coverage 40.39% 40.38% -0.01%
+ Complexity 2575 2574 -1
============================================
Files 5179 5179
Lines 349659 349660 +1
Branches 44688 44689 +1
============================================
- Hits 141251 141225 -26
- Misses 208408 208435 +27 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Description
Fix duplicate peer entry in
createNewRegionPeerAddRegionPeerProcedure.CREATE_NEW_REGION_PEERruns two steps back-to-back:createNewRegionPeerthen reads the partition table back (which already containstargetDataNode) and unconditionally appendsdestDataNodeagain, so theTCreatePeerReqships a duplicated peer — for example[Peer{nodeId=2}, Peer{nodeId=3}, Peer{nodeId=3}]when adding DN_3 to region 3.The duplicate is silently swallowed downstream (
IoTConsensus.createLocalPeerwraps the list in aTreeSet;IoTConsensusV2PeerManageruses aSetand also logs a misleadingDUPLICATE_PEERS_IGNOREDWARN), so it has not produced a visible failure, but it (a) pollutes V2 logs with a WARN on every add-peer, (b) makes per-region migration logs hard to read, and (c) is fragile should anyone later usepeers.size()for an invariant check.The fix is defensive: in the IoT-consensus branch, append
destDataNodeonly when it is not already present inregionReplicaNodes. Behavior is identical to before for callers that do not pre-insert the target.Affects every caller of
AddRegionPeerProcedure:reconstruct region,migrate region,extend region, and procedure restore.Tighten
IoTDBRegionReconstructForIoTV1IT.normal1C3DTestawaitThe post-
reconstructwait condition was:Both halves were weak:
dataDirToBeReconstructed.exists()is effectively always true —deleteTsFilesonly removes files whose name ends in.tsfile, never the containing directory.getRegionStatusWithoutRunning(session).isEmpty()is a negative check. When reconstruct rolled back, the reconstructed peer's row vanished fromshow regionsbriefly while the closed peer's status had not yet been refreshed fromRunningtoUnknown, producing a ~3-second false-positive "everything is Running" window. The test then proceeded past the await, stopped the only live replica, and failed on the subsequent query loop with a misleadingCannot execute query within 60s.Replaced with a positive assertion: the selected region must contain both peers and both rows must report
Running. Done via a newgetRegionStatusMap(Session)helper that returnsregionId -> dataNodeId -> status. On timeout, the test now logs the actual region status map for the selected region, which is the data you want for diagnosing why reconstruct did not finish.Also corrected the stale comment that claimed reconstruct happens "from the leader" — the code actually targets the surviving follower; the original leader is the replica that gets stopped and then restarted.
This PR has:
Key changed/added classes (or packages if there are too many classes) in this PR
org.apache.iotdb.confignode.procedure.env.RegionMaintainHandlerorg.apache.iotdb.confignode.it.regionmigration.IoTDBRegionOperationReliabilityITFrameworkorg.apache.iotdb.confignode.it.regionmigration.pass.commit.IoTDBRegionReconstructForIoTV1IT