Skip to content

Stabilizer reimplementation#1018

Open
gtystahl wants to merge 15 commits intogoogle:mainfrom
gtystahl:stabilizerReimplementation
Open

Stabilizer reimplementation#1018
gtystahl wants to merge 15 commits intogoogle:mainfrom
gtystahl:stabilizerReimplementation

Conversation

@gtystahl
Copy link
Copy Markdown
Collaborator

This is a rework of the stabilizer PR. I tried to address comments as I saw them, but I added some investigation TODOs based on ones that may still need some work (in this or future PRs).

Main bonus of this PR is testing which showcases the ability of the stabilizers much better than last time.

@gtystahl gtystahl requested a review from msuozzo December 12, 2025 15:50
@pohlarized pohlarized self-requested a review January 8, 2026 05:28
Copy link
Copy Markdown
Collaborator

@pohlarized pohlarized left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now i looked mostly at the individual stabilizers and whether the implementation seems sound to me.
For whether the idea behind the stabilizers themselves, and the guarantee that it doesn't allow for any attacks is solid enough to be implemented, I'm having a difficult time evaluating it.
On the one hand, I guess it's sort of subjective how confident you want to be in the safety to accept them.
On the other hand, without an explicit rationale why stabilizers exist in the first place, and why they're believed to be safe, it's hard to come up with an evaluation, since i would essentially have to come up with an attack that i can imagine of the top of my head that can abuse this stabilization behaviour?

But i hope this helps already!

Comment thread pkg/stabilize/pypi.go Outdated
Comment thread pkg/stabilize/pypi.go Outdated
Comment thread pkg/stabilize/pypi.go Outdated
Comment thread pkg/stabilize/pypi.go
keys := make([]string, 0, len(manifest.Header))
for key := range manifest.Header {
if key != "MessageBodyDescription" {
keys = append(keys, key)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is MessageBodyDescription in particular left out here?

Comment thread pkg/stabilize/pypi.go Outdated
Comment on lines +396 to +399
if mayContainGeneratedDocstring(string(originalContent)) {
println("File likely contains auto-generated docstrings, applying stabilizer:", zf.Name)

cleanedContent, err := RemovePythonComments(originalContent)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i understand this, if we find at least a single docstring that we tag as "likely autogenerated", we remove all docstrings from the file?

  1. I'm not super familiar with how stabilizers are applied in oss-rebuild. If we apply them to both, the built artifact and the remote artifact, i believe it is possible that this condition triggers only on one of the files? So the use case is that the remote artifact has autogenerated docstrings, but we don't build them when building via oss-rebuild, right? Then this condition would trigger on the file for the remote artifact, stripping all docstrings, but the "hand written" docstrings would remain in the original artifact? Which may cause problems of the two not matching up again.
  2. I'm not sure how confident we want to be with guaranteeing that the stabilizers have no security impact. But if we're really confident that stripping docstrings is fine, then can't we just unconditionally remove them? And if not, can we really confidently strip all doc comments from a file that contains a single autogenerated one?

Copy link
Copy Markdown
Collaborator

@pohlarized pohlarized Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked about this with William and i think where we ended up was

  • we would probably want some numbers to estimate the impact of this stabilizer
  • ideally, we would only remove the autogenerated docstring instead of removing all docstrings just because a single docstring was likely autogenerated

The second requirement becomes even stronger if stabilizers are applied to both artifacts independently (which i'm not sure of) because otherwise we just strip the remote artifact (which presumably contains the autogenerated docstrings) of all comments, while our own artifact (which presumably does not contain the autogenerated docstrings?) keeps its own docstrings, again causing irreproducibility.

edit: I looked at the stabilizer documentation again and it seems like the stabilizers are indeed independently applied to both packages.

Comment thread pkg/stabilize/pypi.go Outdated
// RemovePythonComments takes a byte slice containing Python code,
// removes all single-line comments (starting with #) and multi-line comments (docstrings),
// and returns the modified code as a new byte slice, preserving original indentation.
func RemovePythonComments(pythonCode []byte) ([]byte, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good effort with this hand-written parser 😄
I have to admit, so far i didn't check it in detail.

However, just in principle, hand-parsing programming languages always feels super brittle to me, and kind of prone to small mistakes or missed edge-cases.
Did you consider something like tree-sitter-go to actually parse the file, and use its output to identify comments and docstrings?
I believe that should be possible to do.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this point as well, and apparently there has been a discussion about using tree-sitter before, but it was not done because it would pull in too large of a dependency, since it's not pure go?
I don't think i'm in a role to decide on whether or not we want new dependencies, so if that's the reason to not use an existing parser then that's how it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants