Skip to content

Commit 85ba069

Browse files
committed
feat: repository analytics & repo populated & repo health score & health score refactor
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
1 parent d6a76c9 commit 85ba069

42 files changed

Lines changed: 1342 additions & 333 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

services/libs/tinybird/datasources/project_insights_copy_ds.datasource

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@ DESCRIPTION >
22
- `project_insights_copy_ds` contains materialized project insights data.
33
- Populated by `project_insights_copy.pipe` copy pipe.
44
- Includes project metadata, health score, first commit, and activity metrics for last 365 days and previous 365 days.
5-
- `id` column is the primary key identifier for the project.
5+
- `id` column is the primary key identifier for the project or repository.
6+
- `type` column indicates the record type: 'project' for project insights or 'repo' for repository insights.
7+
- `repoUrl` column is the full repository URL for repo type records (empty string for project type).
68
- `name` column is the human-readable project name.
79
- `slug` column is the URL-friendly identifier used in routing and filtering.
810
- `logoUrl` column is the URL to the project's logo image.
@@ -35,6 +37,8 @@ TAGS "Project insights", "Metrics"
3537

3638
SCHEMA >
3739
`id` String,
40+
`type` String,
41+
`repoUrl` String,
3842
`name` String,
3943
`slug` String,
4044
`logoUrl` String,
@@ -64,4 +68,4 @@ SCHEMA >
6468
`activeOrganizationsPrevious365Days` UInt64
6569

6670
ENGINE MergeTree
67-
ENGINE_SORTING_KEY id
71+
ENGINE_SORTING_KEY type, id
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
DESCRIPTION >
2+
- `repo_health_score_copy_ds` contains comprehensive health score metrics and benchmarks per repository.
3+
- Created via copy pipe with computed health metrics for repository-level analytics.
4+
- Aggregates multiple health dimensions including contributors, popularity, development activity, and security.
5+
- `channel` is the repository URL used as the primary key.
6+
- `activeContributors` is the unique contributor count for the previous quarter.
7+
- `activeContributorsBenchmark` is the benchmark score (0-5) for active contributors.
8+
- `contributorDependencyCount` measures contributor concentration risk (bus factor).
9+
- `contributorDependencyPercentage` is the combined contribution percentage of dependent contributors.
10+
- `contributorDependencyBenchmark` is the benchmark score (0-5) for contributor dependency.
11+
- `organizationDependencyCount` measures organizational concentration risk.
12+
- `organizationDependencyPercentage` is the combined contribution percentage of dependent organizations.
13+
- `organizationDependencyBenchmark` is the benchmark score (0-5) for organization dependency.
14+
- `retentionRate` is the quarter-over-quarter contributor retention percentage.
15+
- `retentionBenchmark` is the benchmark score (0-5) for retention.
16+
- `stars` is the total star count for the repository.
17+
- `starsBenchmark` is the benchmark score (0-5) for stars.
18+
- `forks` is the total fork count for the repository.
19+
- `forksBenchmark` is the benchmark score (0-5) for forks.
20+
- `issueResolution` is the average days to close issues (nullable for repos without issues).
21+
- `issueResolutionBenchmark` is the benchmark score (0-5) for issue resolution.
22+
- `pullRequests` is the PR count in the last 365 days.
23+
- `pullRequestsBenchmark` is the benchmark score (0-5) for pull requests.
24+
- `mergeLeadTime` is the average days to merge PRs (nullable for repos without PRs).
25+
- `mergeLeadTimeBenchmark` is the benchmark score (0-5) for merge lead time.
26+
- `activeDaysCount` is the count of distinct active days in the last 365 days.
27+
- `activeDaysBenchmark` is the benchmark score (0-5) for active days.
28+
- `contributionsOutsideWorkHours` is the percentage of contributions outside work hours.
29+
- `contributionsOutsideWorkHoursBenchmark` is the benchmark score (0-5) for outside work hours.
30+
- `securityPercentage` is the health score percentage for the security category (0-100).
31+
- `contributorPercentage` is the health score percentage for the contributors category (0-100).
32+
- `popularityPercentage` is the health score percentage for the popularity category (0-100).
33+
- `developmentPercentage` is the health score percentage for the development category (0-100).
34+
- `overallScore` is the computed overall health score combining all dimensions.
35+
36+
TAGS "Repository health", "Metrics"
37+
38+
SCHEMA >
39+
`channel` String,
40+
`activeContributors` UInt64,
41+
`activeContributorsBenchmark` UInt64,
42+
`contributorDependencyCount` UInt64,
43+
`contributorDependencyPercentage` Float64,
44+
`contributorDependencyBenchmark` UInt64,
45+
`organizationDependencyCount` UInt64,
46+
`organizationDependencyPercentage` Float64,
47+
`organizationDependencyBenchmark` UInt64,
48+
`retentionRate` Float64,
49+
`retentionBenchmark` UInt64,
50+
`stars` UInt64,
51+
`starsBenchmark` UInt64,
52+
`forks` UInt64,
53+
`forksBenchmark` UInt64,
54+
`issueResolution` Nullable(Float64),
55+
`issueResolutionBenchmark` UInt64,
56+
`pullRequests` UInt64,
57+
`pullRequestsBenchmark` UInt64,
58+
`mergeLeadTime` Nullable(Float64),
59+
`mergeLeadTimeBenchmark` UInt64,
60+
`activeDaysCount` UInt64,
61+
`activeDaysBenchmark` UInt64,
62+
`contributionsOutsideWorkHours` Float64,
63+
`contributionsOutsideWorkHoursBenchmark` UInt64,
64+
`securityPercentage` Float64,
65+
`contributorPercentage` Float64,
66+
`popularityPercentage` Float64,
67+
`developmentPercentage` Float64,
68+
`overallScore` Float64
69+
70+
ENGINE MergeTree
71+
ENGINE_SORTING_KEY channel
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
DESCRIPTION >
2+
- `repositories_populated_ds` contains enriched repository data with computed metrics.
3+
- Populated by `repositories_populated_copy.pipe` copy pipe.
4+
- Extends base repository data with contributor counts, software valuation, and first commit timestamp.
5+
- `id` is the primary key identifier for the repository record.
6+
- `url` is the full repository URL.
7+
- `segmentId` links to the segment this repository belongs to.
8+
- `insightsProjectId` links to the insights project this repository is associated with.
9+
- `contributorCount` is the total number of unique contributors for the repository.
10+
- `organizationCount` is the total number of unique organizations for the repository.
11+
- `softwareValue` is the estimated economic value of the repository software.
12+
- `firstCommit` is the timestamp of the first commit in the repository (nullable).
13+
14+
TAGS "Repository metadata", "Analytics enrichment"
15+
16+
SCHEMA >
17+
`id` String,
18+
`url` String,
19+
`segmentId` String,
20+
`insightsProjectId` String,
21+
`contributorCount` UInt64,
22+
`organizationCount` UInt64,
23+
`softwareValue` UInt64,
24+
`firstCommit` Nullable(DateTime64(3))
25+
26+
ENGINE MergeTree
27+
ENGINE_SORTING_KEY id, url
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
NODE health_score_active_contributors_benchmark
2+
SQL >
3+
%
4+
SELECT
5+
$GROUP_COL,
6+
activeContributors,
7+
CASE
8+
WHEN activeContributors BETWEEN 0 AND 1 THEN 0
9+
WHEN activeContributors BETWEEN 2 AND 3 THEN 1
10+
WHEN activeContributors BETWEEN 4 AND 6 THEN 2
11+
WHEN activeContributors BETWEEN 7 AND 10 THEN 3
12+
WHEN activeContributors BETWEEN 11 AND 20 THEN 4
13+
WHEN activeContributors > 20 THEN 5
14+
ELSE 0
15+
END AS activeContributorsBenchmark
16+
FROM $SOURCE_NODE
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
NODE health_score_active_days_benchmark
2+
SQL >
3+
%
4+
SELECT
5+
$GROUP_COL,
6+
activeDaysCount,
7+
CASE
8+
WHEN activeDaysCount BETWEEN 0 AND 5 THEN 0
9+
WHEN activeDaysCount BETWEEN 6 AND 10 THEN 1
10+
WHEN activeDaysCount BETWEEN 11 AND 15 THEN 2
11+
WHEN activeDaysCount BETWEEN 16 AND 20 THEN 3
12+
WHEN activeDaysCount BETWEEN 21 AND 26 THEN 4
13+
WHEN activeDaysCount > 26 THEN 5
14+
ELSE 0
15+
END AS activeDaysBenchmark
16+
FROM $SOURCE_NODE
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
NODE health_score_contributions_outside_work_hours_benchmark
2+
SQL >
3+
%
4+
SELECT
5+
$GROUP_COL,
6+
contributionsOutsideWorkHours,
7+
CASE
8+
WHEN contributionsOutsideWorkHours >= 75 THEN 0
9+
WHEN contributionsOutsideWorkHours BETWEEN 50 AND 74 THEN 1
10+
WHEN contributionsOutsideWorkHours BETWEEN 40 AND 49 THEN 2
11+
WHEN contributionsOutsideWorkHours BETWEEN 30 AND 39 THEN 3
12+
WHEN contributionsOutsideWorkHours BETWEEN 20 AND 29 THEN 4
13+
WHEN contributionsOutsideWorkHours BETWEEN 0 AND 19 THEN 5
14+
ELSE 0
15+
END AS contributionsOutsideWorkHoursBenchmark
16+
FROM $SOURCE_NODE
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
NODE health_score_contributor_dependency_pct
2+
SQL >
3+
%
4+
SELECT
5+
$GROUP_COL,
6+
memberId,
7+
contributionCount,
8+
ROUND(contributionCount * 100.0 / SUM(contributionCount) OVER (PARTITION BY $GROUP_COL), 2) AS contributionPercentage
9+
FROM $SOURCE_NODE
10+
ORDER BY contributionPercentage DESC
11+
12+
NODE health_score_contributor_dependency_running
13+
SQL >
14+
%
15+
SELECT
16+
$GROUP_COL,
17+
memberId,
18+
contributionPercentage,
19+
SUM(contributionPercentage) OVER (
20+
PARTITION BY $GROUP_COL ORDER BY contributionPercentage DESC, memberId
21+
) AS contributionPercentageRunningTotal
22+
FROM health_score_contributor_dependency_pct
23+
24+
NODE health_score_contributor_dependency_score
25+
SQL >
26+
%
27+
SELECT
28+
$GROUP_COL,
29+
count() AS contributorDependencyCount,
30+
round(sum(contributionPercentage)) AS contributorDependencyPercentage
31+
FROM health_score_contributor_dependency_running
32+
WHERE
33+
contributionPercentageRunningTotal < 51
34+
OR (contributionPercentageRunningTotal - contributionPercentage < 51)
35+
GROUP BY $GROUP_COL
36+
37+
NODE health_score_contributor_dependency_benchmark
38+
SQL >
39+
%
40+
SELECT
41+
$GROUP_COL,
42+
contributorDependencyCount,
43+
contributorDependencyPercentage,
44+
CASE
45+
WHEN contributorDependencyCount BETWEEN 0 AND 1 THEN 0
46+
WHEN contributorDependencyCount = 2 THEN 1
47+
WHEN contributorDependencyCount BETWEEN 3 AND 4 THEN 2
48+
WHEN contributorDependencyCount BETWEEN 5 AND 6 THEN 3
49+
WHEN contributorDependencyCount BETWEEN 7 AND 9 THEN 4
50+
WHEN contributorDependencyCount > 9 THEN 5
51+
ELSE 0
52+
END AS contributorDependencyBenchmark
53+
FROM health_score_contributor_dependency_score
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
NODE health_score_forks_benchmark
2+
SQL >
3+
%
4+
SELECT
5+
$GROUP_COL,
6+
forks,
7+
CASE
8+
WHEN forks BETWEEN 0 AND 4 THEN 0
9+
WHEN forks BETWEEN 5 AND 9 THEN 1
10+
WHEN forks BETWEEN 10 AND 19 THEN 2
11+
WHEN forks BETWEEN 20 AND 39 THEN 3
12+
WHEN forks BETWEEN 40 AND 79 THEN 4
13+
WHEN forks >= 80 THEN 5
14+
ELSE 0
15+
END AS forksBenchmark
16+
FROM $SOURCE_NODE
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
NODE health_score_issues_resolution_benchmark
2+
SQL >
3+
%
4+
SELECT
5+
$GROUP_COL,
6+
issueResolution,
7+
CASE
8+
WHEN issueResolution >= 61 THEN 0
9+
WHEN issueResolution BETWEEN 51 AND 60 THEN 1
10+
WHEN issueResolution BETWEEN 36 AND 50 THEN 2
11+
WHEN issueResolution BETWEEN 22 AND 35 THEN 3
12+
WHEN issueResolution BETWEEN 8 AND 21 THEN 4
13+
WHEN issueResolution BETWEEN 0 AND 7 THEN 5
14+
ELSE 0
15+
END AS issueResolutionBenchmark
16+
FROM $SOURCE_NODE
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
NODE health_score_merge_lead_time_benchmark
2+
SQL >
3+
%
4+
SELECT
5+
$GROUP_COL,
6+
mergeLeadTime,
7+
CASE
8+
WHEN mergeLeadTime >= 30 THEN 0
9+
WHEN mergeLeadTime BETWEEN 21 AND 30 THEN 1
10+
WHEN mergeLeadTime BETWEEN 15 AND 20 THEN 2
11+
WHEN mergeLeadTime BETWEEN 7 AND 14 THEN 3
12+
WHEN mergeLeadTime BETWEEN 3 AND 6 THEN 4
13+
WHEN mergeLeadTime BETWEEN 0 AND 2 THEN 5
14+
ELSE 0
15+
END AS mergeLeadTimeBenchmark
16+
FROM $SOURCE_NODE

0 commit comments

Comments
 (0)