You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
State-of-the-art LLMs are rarely trained on a single corpus. Instead, they sample from several heterogeneous data sources (code, web, academic papers, forums…). The relative proportion of each source can strongly affect downstream performance. Recent open-source models such as Llama 2 introduced a **temperature‐based sampling scheme** where the probability of drawing a document from corpus *i* becomes
243
+
244
+
```
245
+
p(i) = \frac{w_i^{\alpha}}{\sum_j w_j^{\alpha}}
246
+
```
247
+
248
+
• *w<sub>i</sub>* – raw token percentage of corpus *i*
249
+
• *α* ("temperature") – a value in (0,1]. α < 1 flattens the distribution, giving more weight to smaller high-quality corpora.
250
+
251
+
Llama 2 used α = 0.7 and showed that decreasing α boosted evaluation scores on knowledge-heavy tasks while keeping the training mix stable. The same trick is adopted by Mistral (2023) and Claude 3.
252
+
253
+
```python
254
+
from collections import Counter
255
+
256
+
deftemperature_sample(corpus_ids, alpha=0.7):
257
+
counts = Counter(corpus_ids) # number of tokens seen per corpus
258
+
probs = {c: c_count**alpha for c, c_count in counts.items()}
259
+
Z =sum(probs.values())
260
+
probs = {c: p/Z for c, p in probs.items()}
261
+
# Now draw according to probs to fill every batch
262
+
```
263
+
```
264
+
265
+
### 2. Sequence Packing / Dynamic Batching
266
+
GPU memory is wasted when every sequence in a batch is padded to the longest example. "Packing" concatenates multiple shorter sequences until the **exact** `max_length` is reached and builds a parallel `attention_mask` so that tokens do not attend across segment boundaries. Packing can improve throughput by 20–40 % with no gradient change and is supported out-of-the-box in
Dynamic batching frameworks (e.g. FlashAttention 2, vLLM 2024) combine sequence packing with just-in-time kernel selection, enabling thousand-token context training at 400+ K tokens/s on A100-80G.
272
+
273
+
### 3. Deduplication & Quality Filtering
274
+
Repeated passages cause memorization and provide an easy channel for data-poisoning. Modern pipelines therefore:
275
+
276
+
1. MinHash/FAISS near-duplicate detection at **document** and **128-gram** level.
277
+
2. Filter documents whose perplexity under a small reference model is > µ + 3σ (noisy OCR, garbled HTML).
278
+
3. Block-list documents that contain PII or CWE keywords using regex & spaCy NER.
279
+
280
+
The Llama 2 team deduplicated with 8-gram MinHash and removed ~15 % of CommonCrawl before sampling. OpenAI’s 2024 "Deduplicate Everything" paper demonstrates ≤0.04 duplicate ratio reduces over-fitting and speeds convergence.
281
+
282
+
## Security & Privacy Considerations During Sampling
283
+
284
+
### Data-Poisoning / Backdoor Attacks
285
+
Researchers showed that inserting <1 % backdoored sentences can make a model obey a hidden trigger ("PoisonGPT", 2023). Recommended mitigations:
286
+
287
+
* **Shuffled mixing** – make sure adjacent training examples originate from different sources; this dilutes gradient alignment of malicious spans.
288
+
* **Gradient similarity scoring** – compute cosine similarity of example gradient to batch average; outliers are candidates for removal.
289
+
* **Dataset versioning & hashes** – freeze immutable tarballs and verify SHA-256 before each training run.
290
+
291
+
### Membership-Inference & Memorization
292
+
Long overlap between sliding-window samples increases the chance that rare strings (telephone numbers, secret keys) are memorized. OpenAI’s 2024 study on ChatGPT memorization reports that raising stride from 1 × `max_length` to 4 × reduces verbatim leakage by ≈50 % with negligible loss in perplexity.
293
+
294
+
Practical recommendations:
295
+
296
+
* Use **stride ≥ max_length** except for <1B parameter models where data volume is scarce.
297
+
* Add random masking of 1-3 tokens per window during training; this lowers memorization while preserving utility.
298
+
299
+
---
300
+
301
+
## References
242
302
303
+
- [Build a Large Language Model from Scratch (Manning, 2024)](https://www.manning.com/books/build-a-large-language-model-from-scratch)
304
+
- [Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288)
305
+
- [PoisonGPT: Assessing Backdoor Vulnerabilities in Large Language Models (BlackHat EU 2023)](https://arxiv.org/abs/2308.12364)
0 commit comments