DATA_PREP_SUMMARY.md 1.5 KB

LFM2.5-1.2B Training Data Preparation Summary

Source Data

  • Input: qmd_expansion_v3.jsonl (964,544 bytes, 1,498 entries)
  • Date: Generated from cleaned QMD dataset v3

Conversion Process

  • Script: convert_to_chatml.py
  • Format: Converted to ChatML format for LFM2.5
  • Split: 90% train / 10% validation
  • Shuffle: Applied with seed=42 for reproducibility

Output Files

  • Train set: train.jsonl (913K, 1,348 entries)
  • Validation set: val.jsonl (101K, 150 entries)

Data Quality Verification

  • Success rate: 100% (no format issues detected)
  • ChatML format: All entries properly formatted
  • Required components: All entries contain lex, vec, and hyde expansions

Data Statistics

Training Set (1,348 entries)

  • Query length: 6-65 chars (avg: 29.3)
  • Response length: 307-777 chars (avg: 539.5)

Validation Set (150 entries)

  • Query length: 2-56 chars (avg: 28.5)
  • Response length: 342-762 chars (avg: 536.4)

ChatML Format Structure

<|startoftext|><|im_start|>user
Expand this search query: {original_query}<|im_end|>
<|im_start|>assistant
lex: {lexical_expansion_1}
lex: {lexical_expansion_2}
...
vec: {vector_expansion_1}
vec: {vector_expansion_2}
...
hyde: {hypothetical_document}
<|im_end|>

Verification

  • Format validation: ✅ PASSED
  • Content completeness: ✅ PASSED
  • File integrity: ✅ PASSED
  • Ready for LFM2.5 training: ✅ YES

Generated: $(date) Conversion time: ~2 seconds Data ready for fine-tuning