Skip to content

Fix duplicate peer in createNewRegionPeer; harden reconstruct IT#17710

Merged
JackieTien97 merged 1 commit into
masterfrom
improve-region-reconstruct-iotv1-test
May 19, 2026
Merged

Fix duplicate peer in createNewRegionPeer; harden reconstruct IT#17710
JackieTien97 merged 1 commit into
masterfrom
improve-region-reconstruct-iotv1-test

Conversation

@CRZbulabula
Copy link
Copy Markdown
Contributor

Description

Fix duplicate peer entry in createNewRegionPeer

AddRegionPeerProcedure.CREATE_NEW_REGION_PEER runs two steps back-to-back:

handler.addRegionLocation(regionId, targetDataNode);   // writes target into PartitionTable
TSStatus status = handler.createNewRegionPeer(regionId, targetDataNode);

createNewRegionPeer then reads the partition table back (which already contains targetDataNode) and unconditionally appends destDataNode again, so the TCreatePeerReq ships a duplicated peer — for example [Peer{nodeId=2}, Peer{nodeId=3}, Peer{nodeId=3}] when adding DN_3 to region 3.

The duplicate is silently swallowed downstream (IoTConsensus.createLocalPeer wraps the list in a TreeSet; IoTConsensusV2PeerManager uses a Set and also logs a misleading DUPLICATE_PEERS_IGNORED WARN), so it has not produced a visible failure, but it (a) pollutes V2 logs with a WARN on every add-peer, (b) makes per-region migration logs hard to read, and (c) is fragile should anyone later use peers.size() for an invariant check.

The fix is defensive: in the IoT-consensus branch, append destDataNode only when it is not already present in regionReplicaNodes. Behavior is identical to before for callers that do not pre-insert the target.

Affects every caller of AddRegionPeerProcedure: reconstruct region, migrate region, extend region, and procedure restore.

Tighten IoTDBRegionReconstructForIoTV1IT.normal1C3DTest await

The post-reconstruct wait condition was:

() -> getRegionStatusWithoutRunning(session).isEmpty()
      && dataDirToBeReconstructed.getAbsoluteFile().exists()

Both halves were weak:

  • dataDirToBeReconstructed.exists() is effectively always true — deleteTsFiles only removes files whose name ends in .tsfile, never the containing directory.
  • getRegionStatusWithoutRunning(session).isEmpty() is a negative check. When reconstruct rolled back, the reconstructed peer's row vanished from show regions briefly while the closed peer's status had not yet been refreshed from Running to Unknown, producing a ~3-second false-positive "everything is Running" window. The test then proceeded past the await, stopped the only live replica, and failed on the subsequent query loop with a misleading Cannot execute query within 60s.

Replaced with a positive assertion: the selected region must contain both peers and both rows must report Running. Done via a new getRegionStatusMap(Session) helper that returns regionId -> dataNodeId -> status. On timeout, the test now logs the actual region status map for the selected region, which is the data you want for diagnosing why reconstruct did not finish.

Also corrected the stale comment that claimed reconstruct happens "from the leader" — the code actually targets the surviving follower; the original leader is the replica that gets stopped and then restarted.


This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.

Key changed/added classes (or packages if there are too many classes) in this PR
  • org.apache.iotdb.confignode.procedure.env.RegionMaintainHandler
  • org.apache.iotdb.confignode.it.regionmigration.IoTDBRegionOperationReliabilityITFramework
  • org.apache.iotdb.confignode.it.regionmigration.pass.commit.IoTDBRegionReconstructForIoTV1IT

@sonarqubecloud
Copy link
Copy Markdown

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.38%. Comparing base (4ac0b7e) to head (3fff9dd).

Files with missing lines Patch % Lines
...onfignode/procedure/env/RegionMaintainHandler.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17710      +/-   ##
============================================
- Coverage     40.39%   40.38%   -0.01%     
+ Complexity     2575     2574       -1     
============================================
  Files          5179     5179              
  Lines        349659   349660       +1     
  Branches      44688    44689       +1     
============================================
- Hits         141251   141225      -26     
- Misses       208408   208435      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JackieTien97 JackieTien97 merged commit a37621b into master May 19, 2026
42 of 44 checks passed
@JackieTien97 JackieTien97 deleted the improve-region-reconstruct-iotv1-test branch May 19, 2026 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants