LFM2.5-1.2B Training Data Preparation Summary
Source Data
- Input:
qmd_expansion_v3.jsonl (964,544 bytes, 1,498 entries)
- Date: Generated from cleaned QMD dataset v3
Conversion Process
- Script:
convert_to_chatml.py
- Format: Converted to ChatML format for LFM2.5
- Split: 90% train / 10% validation
- Shuffle: Applied with seed=42 for reproducibility
Output Files
- Train set:
train.jsonl (913K, 1,348 entries)
- Validation set:
val.jsonl (101K, 150 entries)
Data Quality Verification
- Success rate: 100% (no format issues detected)
- ChatML format: All entries properly formatted
- Required components: All entries contain lex, vec, and hyde expansions
Data Statistics
Training Set (1,348 entries)
- Query length: 6-65 chars (avg: 29.3)
- Response length: 307-777 chars (avg: 539.5)
Validation Set (150 entries)
- Query length: 2-56 chars (avg: 28.5)
- Response length: 342-762 chars (avg: 536.4)
ChatML Format Structure
<|startoftext|><|im_start|>user
Expand this search query: {original_query}<|im_end|>
<|im_start|>assistant
lex: {lexical_expansion_1}
lex: {lexical_expansion_2}
...
vec: {vector_expansion_1}
vec: {vector_expansion_2}
...
hyde: {hypothetical_document}
<|im_end|>
Verification
- Format validation: ✅ PASSED
- Content completeness: ✅ PASSED
- File integrity: ✅ PASSED
- Ready for LFM2.5 training: ✅ YES
Generated: $(date)
Conversion time: ~2 seconds
Data ready for fine-tuning