Stabilizer reimplementation#1018
Conversation
pohlarized
left a comment
There was a problem hiding this comment.
For now i looked mostly at the individual stabilizers and whether the implementation seems sound to me.
For whether the idea behind the stabilizers themselves, and the guarantee that it doesn't allow for any attacks is solid enough to be implemented, I'm having a difficult time evaluating it.
On the one hand, I guess it's sort of subjective how confident you want to be in the safety to accept them.
On the other hand, without an explicit rationale why stabilizers exist in the first place, and why they're believed to be safe, it's hard to come up with an evaluation, since i would essentially have to come up with an attack that i can imagine of the top of my head that can abuse this stabilization behaviour?
But i hope this helps already!
| keys := make([]string, 0, len(manifest.Header)) | ||
| for key := range manifest.Header { | ||
| if key != "MessageBodyDescription" { | ||
| keys = append(keys, key) |
There was a problem hiding this comment.
Why is MessageBodyDescription in particular left out here?
| if mayContainGeneratedDocstring(string(originalContent)) { | ||
| println("File likely contains auto-generated docstrings, applying stabilizer:", zf.Name) | ||
|
|
||
| cleanedContent, err := RemovePythonComments(originalContent) |
There was a problem hiding this comment.
As i understand this, if we find at least a single docstring that we tag as "likely autogenerated", we remove all docstrings from the file?
- I'm not super familiar with how stabilizers are applied in oss-rebuild. If we apply them to both, the built artifact and the remote artifact, i believe it is possible that this condition triggers only on one of the files? So the use case is that the remote artifact has autogenerated docstrings, but we don't build them when building via oss-rebuild, right? Then this condition would trigger on the file for the remote artifact, stripping all docstrings, but the "hand written" docstrings would remain in the original artifact? Which may cause problems of the two not matching up again.
- I'm not sure how confident we want to be with guaranteeing that the stabilizers have no security impact. But if we're really confident that stripping docstrings is fine, then can't we just unconditionally remove them? And if not, can we really confidently strip all doc comments from a file that contains a single autogenerated one?
There was a problem hiding this comment.
I talked about this with William and i think where we ended up was
- we would probably want some numbers to estimate the impact of this stabilizer
- ideally, we would only remove the autogenerated docstring instead of removing all docstrings just because a single docstring was likely autogenerated
The second requirement becomes even stronger if stabilizers are applied to both artifacts independently (which i'm not sure of) because otherwise we just strip the remote artifact (which presumably contains the autogenerated docstrings) of all comments, while our own artifact (which presumably does not contain the autogenerated docstrings?) keeps its own docstrings, again causing irreproducibility.
edit: I looked at the stabilizer documentation again and it seems like the stabilizers are indeed independently applied to both packages.
| // RemovePythonComments takes a byte slice containing Python code, | ||
| // removes all single-line comments (starting with #) and multi-line comments (docstrings), | ||
| // and returns the modified code as a new byte slice, preserving original indentation. | ||
| func RemovePythonComments(pythonCode []byte) ([]byte, error) { |
There was a problem hiding this comment.
Good effort with this hand-written parser 😄
I have to admit, so far i didn't check it in detail.
However, just in principle, hand-parsing programming languages always feels super brittle to me, and kind of prone to small mistakes or missed edge-cases.
Did you consider something like tree-sitter-go to actually parse the file, and use its output to identify comments and docstrings?
I believe that should be possible to do.
There was a problem hiding this comment.
We talked about this point as well, and apparently there has been a discussion about using tree-sitter before, but it was not done because it would pull in too large of a dependency, since it's not pure go?
I don't think i'm in a role to decide on whether or not we want new dependencies, so if that's the reason to not use an existing parser then that's how it is.
PyPI Stabilizers: improve file name checks
Updating the stable stabilizer branch with current main changes
Catching back up from main
Signed-off-by: gtystahl <43938328+gtystahl@users.noreply.github.com>
This is a rework of the stabilizer PR. I tried to address comments as I saw them, but I added some investigation TODOs based on ones that may still need some work (in this or future PRs).
Main bonus of this PR is testing which showcases the ability of the stabilizers much better than last time.