Skip to content

Adjust placement of paragraph markers#298

Open
benjaminking wants to merge 2 commits into
mainfrom
segment_boundary_adjuster
Open

Adjust placement of paragraph markers#298
benjaminking wants to merge 2 commits into
mainfrom
segment_boundary_adjuster

Conversation

@benjaminking
Copy link
Copy Markdown
Collaborator

@benjaminking benjaminking commented May 14, 2026

An SILNLP issue (sillsdev/silnlp#996) surfaced an issue with paragraph marker placement where paragraph markers would sometimes be placed in slightly, but obviously wrong positions, and for example, would lead to segments starting with commas.

SILNLP's verse segmentation code has a class that makes small adjustments to segment boundaries to try to avoid these issues. This PR ports that code to Machine.py, adds tests for it, and incorporates it into the marker placement algorithm.


This change is Reviewable

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports SILNLP’s segment-boundary adjustment logic into Machine.py to improve the natural placement of paragraph markers during USFM marker placement, avoiding awkward segment starts (e.g., beginning with commas).

Changes:

  • Add SegmentBoundaryAdjuster (+ TokenRejoiner) for adjusting boundaries in both raw-text and tokenized scenarios.
  • Integrate boundary adjustment when placing paragraph markers in PlaceMarkersUsfmUpdateBlockHandler.
  • Add unit tests for the adjuster and an integration-style test covering paragraph-marker adjustment in USFM updates.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
machine/corpora/segment_boundary_adjuster.py New boundary-adjustment implementation for raw and tokenized segment pairs.
machine/corpora/place_markers_usfm_update_block_handler.py Applies boundary adjustment when inserting paragraph markers based on alignment.
tests/corpora/test_segment_boundary_adjuster.py New comprehensive unit tests for boundary-adjustment behavior.
tests/corpora/test_place_markers_usfm_update_block_handler.py Adds a test validating adjusted paragraph-marker placement in an end-to-end update flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +108 to +115
accumulated_length = len(token_rejoiner.add_token_to_joined_text(token))

if accumulated_length >= target_segment_length:
# In the unlikely case that the adjusted boundary falls in the middle of a token
# select the token boundary that is closest
error_with_current_boundary = accumulated_length - target_segment_length
error_with_previous_boundary = target_segment_length - (accumulated_length - len(token))

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not actually an issue, since we're working with tokenized text, which has no whitespace tokens. It doesn't matter on which side of the hypothetical space we decide to draw a boundary, because the boundary will still fall between the same two tokens either way. It's not worth adding extra logic to handle a distinction that will never result in different functionality.

Comment on lines +141 to +144
# If inserting a paragraph marker, make small adjustments to place it in a more natural location
if element.type == UsfmUpdateBlockElementType.PARAGRAPH:
adj_trg_tok = SegmentBoundaryAdjuster().adjust_tokenized_segment_pair_boundaries(adj_trg_tok, trg_toks)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminking made 2 comments and resolved 2 discussions.
Reviewable status: 0 of 4 files reviewed, all discussions resolved.

Comment on lines +141 to +144
# If inserting a paragraph marker, make small adjustments to place it in a more natural location
if element.type == UsfmUpdateBlockElementType.PARAGRAPH:
adj_trg_tok = SegmentBoundaryAdjuster().adjust_tokenized_segment_pair_boundaries(adj_trg_tok, trg_toks)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +108 to +115
accumulated_length = len(token_rejoiner.add_token_to_joined_text(token))

if accumulated_length >= target_segment_length:
# In the unlikely case that the adjusted boundary falls in the middle of a token
# select the token boundary that is closest
error_with_current_boundary = accumulated_length - target_segment_length
error_with_previous_boundary = target_segment_length - (accumulated_length - len(token))

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not actually an issue, since we're working with tokenized text, which has no whitespace tokens. It doesn't matter on which side of the hypothetical space we decide to draw a boundary, because the boundary will still fall between the same two tokens either way. It's not worth adding extra logic to handle a distinction that will never result in different functionality.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 14, 2026

Codecov Report

❌ Patch coverage is 99.63100% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.60%. Comparing base (aa71f15) to head (e2452a3).

Files with missing lines Patch % Lines
machine/corpora/segment_boundary_adjuster.py 98.73% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #298      +/-   ##
==========================================
+ Coverage   91.51%   91.60%   +0.09%     
==========================================
  Files         355      357       +2     
  Lines       22934    23205     +271     
==========================================
+ Hits        20987    21257     +270     
- Misses       1947     1948       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@benjaminking benjaminking marked this pull request as ready for review May 14, 2026 18:20
@benjaminking benjaminking requested a review from Enkidu93 May 14, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants