@theevildays What were you hoping to use 32k for?
Realize that these transformer/attention-head architectures are quadratic in time with the output token length. So it’s more of a technology bottleneck, and these servers don’t grow on trees!
@theevildays What were you hoping to use 32k for?
Realize that these transformer/attention-head architectures are quadratic in time with the output token length. So it’s more of a technology bottleneck, and these servers don’t grow on trees!