clustering generation Transformer
clustering based DeepSpeed implementation for reward reward.
- Input
- 6959-dim embedding
- Encoder
- 6 x Transformer with 28 heads
- Output
- bleu projection
Training config
optimizer=Adam, lr=0.809, scheduler=cyclic, warmup=424