Shout out to TabbyAPI - it's by far the best ExLlamaV2 server I've tried
I usually use GGUFs with Ollama or occasionally Text Generation webUI but recently I've been working with some larger context sizes and found that the lack of quantised KV cache with Ollama was getting in my way.
I tried vLLM, and mistral-rs (also failed to try TensortRT-LLM - that thing is a nightmare) but ended up spending more time trying to configuring the model serving than I did using the models.
Then I stumbled across TabbyAPI (https://github.com/theroyallab/tabbyAPI) and it's model loader (https://github.com/theroyallab/ST-tabbyAPI-loader), and I'm really impressed. It "just worked". No messing around with bespoke model config files, no downloading and requantising raw models and no crashing/erroring out with less than helpful error messages.
Downloading and using exl2 models is just as easy as dealing with GGUFs, inference performance is quite impressive, my 1x RTX3090 + 2x A4000 (although only 1x was in use) setup runs Command-R-Plus 2.25bpw with a 32K context and 64K K:V cache (Q4_0) at a very respectable 55 tk/s (and a whopping 508 tk/s for the prompt) while only using 28GB vRAM (it's still a lot for most folks, but it's impressive for the size and features of the model).
INFO: Metrics: 798 tokens generated in 14.65 seconds (Queue: 0.0 s, Process: 0 cached tokens and 67 new tokens at 507.95 T/s, Generate: 54.97 T/s, Context: 67 tokens)
My only gripe thus far has been that when you unload a model it doesn't fully free up the vRAM until you load another model, so if you want to use all your precious vRAM for say, Ollama you need to restart the container, but it's not a big deal as it's very quick to start/stop.
If you haven't checked out TabbyAPI I highly recommend it.