Skip to content

Commit 8f581c1

Browse files
committed
Merge branch 'main' into feat/repository-analytics
2 parents afd6bd6 + 76d870a commit 8f581c1

58 files changed

Lines changed: 4906 additions & 788 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/settings.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"enabledPlugins": {
3+
"typescript-lsp@claude-plugins-official": true,
4+
"pyright-lsp@claude-plugins-official": true
5+
}
6+
}

.claude/skills/scaffold-snowflake-connector/SKILL.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ After Step 1 and/or Step 2, build a column registry per table:
196196

197197
**Store this as the canonical column reference. Every column name used in generated code must appear in this registry. Never assume or invent a column name.**
198198

199+
**Flag non-VARCHAR column types** (e.g., `DATE`, `TIME`, `TIMESTAMP_TZ`, `BOOLEAN`, `NUMBER`) — these arrive as native JS types from the Parquet reader, not strings (see touch point 9 rules).
200+
199201
For each JOIN table, check whether any existing transformer in `services/apps/snowflake_connectors/src/integrations/` queries from the same table. If yes, inherit its column mappings; if no, treat every column as unknown and derive it from sample data in the Pre-Analysis step below.
200202

201203
### Step 3 — Sample data
@@ -342,7 +344,10 @@ After all identity fields are confirmed, summarize how `buildMemberIdentities()`
342344
343345
### 3b. Organization Mapping
344346
345-
If Pre-Analysis determined there is no org data (no org-related columns found in any table), confirm: "I don't see any organization columns in the schema. Does this source have org/company data?" — if yes, proceed; if no, skip to 3c.
347+
If Pre-Analysis determined there is no org data (no org-related columns found in any table): before asking the user, first read existing transformers in `services/apps/snowflake_connectors/src/integrations/` to check whether any of them join an org table using a key that also exists in the user's tables. If a match is found, prompt the user:
348+
> "I don't see org columns in the tables you provided, but [EXISTING_PLATFORM] sources org data from `{ORG_TABLE}` via `{join_key}` — which also appears in your table. Did you mean to include this? (Recommended)"
349+
350+
If no existing pattern is joinable, ask: "I don't see any org columns. Does this source have org/company data?" — if yes, ask for the table; if no, skip to 3c.
346351
347352
If Pre-Analysis identified org columns:
348353
@@ -565,6 +570,7 @@ File: `services/apps/snowflake_connectors/src/integrations/{platform}/{source}/b
565570
**Rules (enforced — do not deviate):**
566571
- Use explicit column names only. Do not use `table.*` or `table.* EXCLUDE (...)` in new implementations — existing sources (TNC, CVENT) use these patterns but new sources should list columns explicitly to avoid parquet encoding/decoding issues
567572
- If any TIMESTAMP_TZ columns exist in the schema, exclude and re-cast them as TIMESTAMP_NTZ (see CVENT pattern)
573+
- Do not concatenate or transform date/time columns in SQL — keep them as separate columns and let the transformer handle type coercion (see touch point 9 rules)
568574
- Follow the CTE structure:
569575
1. `org_accounts` CTE (if org data present)
570576
2. `CDP_MATCHED_SEGMENTS` CTE (always)
@@ -585,6 +591,8 @@ Show the full generated file and ask for confirmation before writing.
585591
File: `services/apps/snowflake_connectors/src/integrations/{platform}/{source}/transformer.ts`
586592

587593
**Rules (enforced — do not deviate):**
594+
595+
- **Parquet type coercion — never blindly cast `row.COLUMN as string`.** Snowflake types may arrive as native JS types after Parquet decoding (e.g., `DATE``Date` object, `TIME``number` in ms, `BOOLEAN``boolean`). Always check the Snowflake column type from the schema registry and handle the actual JS type the Parquet reader delivers — do not assume every column is a string.
588596
- All string comparisons must be case-insensitive: use `.toLowerCase()` on both sides of comparison only; preserve the original value in the output
589597
- No broad `else` statements — every branch must have an explicit condition
590598
- All column names referenced in code must exactly match the schema registry — never assumed

.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -196,13 +196,13 @@ services/libs/tinybird/.diff_tmp
196196
# custom cursor rules
197197
.cursor/rules/*.local.mdc
198198

199-
# claude code
199+
# claude
200+
.claude/settings.local.json
200201
.claude/cache/
201202
.claude/tmp/
202203
.claude/logs/
203204
.claude/sessions/
204205
.claude/todos/
205-
.claude/settings.local.json
206206

207207
# git integration test repositories & output
208208
services/apps/git_integration/src/test/repos/

CLAUDE.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# CDP — Community Data Platform
2+
3+
CDP is a community data platform by the Linux Foundation. It ingests millions of
4+
activities and events daily from platforms like GitHub, GitLab, and many others
5+
(not just code hosting). Open-source projects get onboarded by connecting
6+
integrations, and data flows continuously at scale.
7+
8+
The ingested data is often messy. A big part of what CDP does is improve data quality: deduplicating member and organization profiles through merge and unmerge operations, enriching data via third-party providers, and resolving identities across sources. The cleaned data powers analytics and insights for LFX products.
9+
10+
The codebase started as crowd.dev, an open-source startup later acquired by the Linux Foundation. Speed was prioritized over standards, but the platform is now stable. The focus has shifted to maintainable patterns, scalability, and good developer experience. Performance matters at this scale, even small inefficiencies compound across millions of data points.
11+
12+
## Tech stack
13+
14+
TypeScript, Node.js, Express, PostgreSQL (pg-promise), Temporal, Kafka, Redis, OpenSearch, Zod, Bunyan, AWS S3.
15+
16+
Vue 3, Vite, Tailwind CSS, Element Plus, Pinia, TanStack Vue Query, Axios.
17+
18+
Package manager is **pnpm**. Monorepo managed via pnpm workspaces.
19+
20+
## Codebase structure
21+
22+
```
23+
backend/ -> APIs (public endpoints for LFX products + internal for CDP UI)
24+
frontend/ -> CDP Platform UI
25+
services/apps/ -> Microservices — Temporal workers, Node.js workers, webhook APIs
26+
services/libs/ -> Shared libraries used across services
27+
```
28+
29+
`services/libs/common` holds shared utilities, error classes,
30+
and helpers. If a piece of logic is reusable (not business logic), it belongs there.
31+
32+
`services/libs/data-access-layer` holds all
33+
database query functions. Check here before writing new ones — duplicates are
34+
already a problem.
35+
36+
## Patterns in transition
37+
38+
Old and new patterns coexist. Always use the new pattern.
39+
40+
- **Sequelize -> pg-promise**: Sequelize is legacy (backend only). Use
41+
`queryExecutor` from `@crowd/data-access-layer` for all new database code.
42+
- **Classes -> functions**: Class-based services and repos are legacy. Write
43+
plain functions — composable, modular, easy to test.
44+
- **Multi-tenancy -> single tenant**: Multi-tenancy is being phased out. The
45+
tenant table still exists. Code uses `DEFAULT_TENANT_ID` from `@crowd/common`.
46+
Don't add new multi-tenant logic.
47+
- **Legacy auth -> Auth0**: Auth0 is the current auth system. Ignore old JWT
48+
patterns.
49+
- **Zod for validation**: Public API endpoints use Zod schemas with
50+
`validateOrThrow`. Follow this pattern for all new endpoints.
51+
52+
## Working with the database
53+
54+
Millions of rows. Every query matters.
55+
56+
- Look up the table schema and indexes before writing any query. Don't select
57+
or touch columns blindly.
58+
- Check existing functions in `data-access-layer` before writing new ones.
59+
Weigh the blast radius of modifying a shared function — sometimes a new
60+
function is safer.
61+
- Write queries with performance in mind. Think about what indexes exist, what
62+
the query plan looks like, and whether you're scanning more rows than needed.
63+
64+
## Code quality
65+
66+
- Functional and modular. Code should be easy to plug in, pull out, and test
67+
independently.
68+
- Think about performance at scale, even for small changes.
69+
- Define types properly — extend and reuse existing types. Don't sprinkle `any`.
70+
- Don't touch working code outside the scope of the current task.
71+
- Prefer doing less over introducing risk. Weigh trade-offs before acting.

0 commit comments

Comments
 (0)