What's Changed
- refactor by @SeanLee97 in #112
Detailed changes:
- use uv to manage dependencies
- simplify the implementation
- Remove all imports of AngleDataTokenizer
- Remove all imports of DatasetFormats
- Remove all .map(AngleDataTokenizer(...)) calls
- Update dataset field names (text → query for Format B/C) OR use --column_rename_mapping
- Add is_llm=True to LLM model initialization
- Replace --prompt_template with --text_prompt, --query_prompt, or --doc_prompt
- Update training scripts to use accelerate launch
- Update evaluation code if using the return value
- Support input data as a list of strings. New data formats:
- A: {"text1": str | List[str], "text2": str | List[str], "label": float}
- B: {"query": str | List[str], "positive": str | List[str]}
- C: {"query": str | List[str], "positive": str | List[str], "negative": str | List[str]}
- Support fsdp training
- Update docs
Migration guide: https://github.com/SeanLee97/AnglE/blob/main/MIGRATION_GUIDE.md
Full Changelog: v0.5.6...v0.6.0