| qms_version | 2.2.0 |
|---|---|
| sop_version | 2.0.1 |
| Document ID | CSC PR.024 |
| Document Version | 2.0.1 |
| Author | |
| Approval | |
| QMS Version | 2.2.0 |
| Regulatory References |
The purpose of this SOP is to establish a systematic approach to handling sensitive data disclosures on GitHub repositories, with a focus on text data. The procedure addresses various scenarios based on the repository's privacy settings and the type of data being disclosed.
This SOP applies to all individuals and teams who handle sensitive data and might disclose such sensitive data on remote repositories. It includes all types of text/tabular sensitive data and covers both private and public GitHub repositories.
| Acronym | Definition |
|---|---|
| CSC | Clinical Scientific Computing |
| QMS | Quality Management System |
| CAPA | Clinical and Product Assurance |
| DATIX | Web based reporting and risk management software for health and social care institutions |
| Role | Description | Responsibility |
|---|---|---|
| Clinical Safety Officer | Clinician | Provide expertise and leadership in all activities associated with the evaluation of clinical risk management for the product. |
| Product owner | Head of Department | Takes ultimate responsibility for deployed product |
A data leak on a remote repository occurs when sensitive information is unintentionally committed and pushed to GitHub.
-
Data leaks can include but are not limited to:
-
Personal data: Direct patient identifiers such as names, phone numbers, addresses, patient reports/results.
-
Security credentials: This includes usernames, passwords, API keys, tokens, or other information that can be used to gain unauthorized access to systems/platforms/data.
Even if you have deleted sensitive data from your GitHub repository, rebased, or rewritten the history, the information might still be present in the cache. GitHub's caching systems do not immediately clear on these actions, meaning that cached or 'dangling' commits may still be accessible after being removed from the repository's visible history.
- Identify the leaked sensitive data and determine the extent of the disclosure. Keep a record of the leaked data for traceability.
- Investigate and log leak source: how did the data end up in the repository?
- Evaluate the potential impact of the leak on affected individuals.
- Investigate the repository's commit history and branch activities to identify when and how the sensitive data was pushed. Log the specific commits, branches, and potential PRs involved in the data leak.
- Assess the duration of the exposure and the potential number of individuals who may have accessed the data during that time. It includes people who might have forked, cloned the repository in the meantime.
- Raise a CAPA and inform the team before taking any of the next actions.
- Ensure to keep a record of the leaked data for traceability, as the pushed file(s) will be removed.
- Use BFG Cleaner tool to remove sensitive data from the repository's history, including commits, branches, and tags. A Step-by-step tutorial can be found here https://rtyley.github.io/bfg-repo-cleaner/.
- Review Commit History: Inspect the repository's commit history to ensure the complete removal of sensitive data and verify that no residual traces remain. Use git log command.
- Determine with the CSO what the next steps are (DATIX).
DO NOT ask github to clear cache.
- Use BFG Cleaner: Similar to the process for public repositories, employ the BFG Cleaner tool to eliminate sensitive data from the repository's history.
- Verify Data Removal: Thoroughly inspect the commit history and repository files to ensure that all sensitive data has been removed. Use git log command.
Even after implementing these steps, there may still be residual cache data or dangling commits persisting in the repository. Confirm with CSO that all the steps were correctly followed and proceed to the next steps.
- Read GitHub guidance about sensitive data as their policy might have been updated.
- Open a ticket for GitHub Team to remove cached data associated with the leaked sensitive information. The URL of the dangling commits will be asked.
- Determine the reasons why the existing preventive measures were unsuccessful in stopping the data leak. Currently, these measures include manual checking, the inclusion of csv, txt, and ipynb file types in .gitignore files and the use of pre-commit hooks.
- Take action: strengthen pre-commit hooks: Enhance the existing pre-commit hooks.