A caveat: my impression of Phi-2, based on my own use and others’ experiences online, is that these benchmarks do not remotely resemble reality. The model is a paper tiger that is unable to perform almost any real-world task because it’s been fed so heavily with almost exclusively synthetic data targeted towards improving benchmark performance.
Fun that's not my experience of Phi-2. I use it for non-creative context, but function calling, and I find as reliable as much bigger models (no fine-tuning just constraining JSON + CoT). Phi-2 unquantized vs Mixtral Q8, Mixtral is not definitely better but much slower and RAM-hungry.
What prompts/settings do you use for Phi-2? I found it completely unusable for my cases. It fails to follow basic instructions (I tried several instruction-following finetunes as well, in addition to the base model), and it's been mostly like a random garbage generator for me. With Llama.cpp, constrained to JSON, it also often hangs because it fails to find continuations which satisfy the JSON grammar.
I'm building a system which has many different passes (~15 so far). Almost every pass is a LLM invocation, which takes time. My original idea was to use a smaller model, such as Phi-2, as a gateway in front of all those passes: I'd describe which pass does what, and then ask Phi-2 to list the passes which are relevant for the user query (I called it "pass masking"). That would save a lot of time and collapse 15 steps to 2-3 steps on average. In fact, my Solar 10.7B model does it pretty well, but it takes 7 seconds for the masking pass to work on my GPU. Phi-2 would finish in ~1 second. However, I'm really struggling with Phi-2: it fails to reason (what's relevant and what's not), unlike Solar, and it also refuses to follow the output format (so that I could parse the output programmatically and disable the irrelevant passes). Again, my proof of concept works with Solar, and fails spectacularly with Phi-2.
> You are a helpful assistant to 'User'. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'. 'System' will give you data. Do not respond as 'System'. Allow yourself inner thoughts as 'Thoughts'.
and then I constrain its answers to Thoughts: [^\n]* and Assistant: <JSON schema>, and I have two shots included in the prompt.
I haven't been able to get anything useful out of Phi-2 in llama.cpp (but I only tried quantized models). I use python/huggingface's transformers lib instead.
An update on my endeavour: so, model switching is very costly under llama.cpp (I have to switch between Llama and Phi2 because my GPU has low amounts of VRAM). And this switch (reloading the weights into VRAM) defeats the whole purpose of the optimization. Having only Llama on GPU without reloading takes less time than if I'd use Llama+Phi2. And Phi2 alone is pretty bad as a general purpose LLM. So I'm quite disappointed.
I recently upgraded to AM5 and as I have an AMD GPU I'm using llama.cpp on CPU only and I was positively surprised by how fast it generate stuff. I don't have the case of massive workloads so YMMV.
Hear hear! I don't understand why it has persistent mindshare, it's not even trained for chat. Meanwhile StableLM 3B runs RAG in my browser, on my iPhone, on my Pixel ..
Idea is usage-based charging for non-local and a $5/month sub for syncing.
keep an eye on @jpohhhh on Twitter if you're interested
now that I got it on web, I'm hoping to at least get a PoC up soon. I've open-sourced the consitutent parts as FONNX and FLLAMA, Flutter libraries that work on all platforms. FONNX has embeddings, FLLAMA has llama.
[1] https://www.microsoft.com/en-us/research/blog/phi-2-the-surp...