Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Came here to post the same thing for Phi-2:

  +-------------+----------+-------------+
  | Benchmark   | Gemma 2B | Phi-2 2.7B  |
  +-------------+----------+-------------+
  | MMLU        |   42.3   |     56.7    |
  | MBPP        |   29.2   |     59.1    |
  | BoolQ       |   69.4   |     83.3    |
  +-------------+----------+-------------+

[0] https://www.kaggle.com/models/google/gemma

[1] https://www.microsoft.com/en-us/research/blog/phi-2-the-surp...



A caveat: my impression of Phi-2, based on my own use and others’ experiences online, is that these benchmarks do not remotely resemble reality. The model is a paper tiger that is unable to perform almost any real-world task because it’s been fed so heavily with almost exclusively synthetic data targeted towards improving benchmark performance.


Fun that's not my experience of Phi-2. I use it for non-creative context, but function calling, and I find as reliable as much bigger models (no fine-tuning just constraining JSON + CoT). Phi-2 unquantized vs Mixtral Q8, Mixtral is not definitely better but much slower and RAM-hungry.


What prompts/settings do you use for Phi-2? I found it completely unusable for my cases. It fails to follow basic instructions (I tried several instruction-following finetunes as well, in addition to the base model), and it's been mostly like a random garbage generator for me. With Llama.cpp, constrained to JSON, it also often hangs because it fails to find continuations which satisfy the JSON grammar.

I'm building a system which has many different passes (~15 so far). Almost every pass is a LLM invocation, which takes time. My original idea was to use a smaller model, such as Phi-2, as a gateway in front of all those passes: I'd describe which pass does what, and then ask Phi-2 to list the passes which are relevant for the user query (I called it "pass masking"). That would save a lot of time and collapse 15 steps to 2-3 steps on average. In fact, my Solar 10.7B model does it pretty well, but it takes 7 seconds for the masking pass to work on my GPU. Phi-2 would finish in ~1 second. However, I'm really struggling with Phi-2: it fails to reason (what's relevant and what's not), unlike Solar, and it also refuses to follow the output format (so that I could parse the output programmatically and disable the irrelevant passes). Again, my proof of concept works with Solar, and fails spectacularly with Phi-2.


My non-domain-specific prompt is:

> You are a helpful assistant to 'User'. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'. 'System' will give you data. Do not respond as 'System'. Allow yourself inner thoughts as 'Thoughts'.

and then I constrain its answers to Thoughts: [^\n]* and Assistant: <JSON schema>, and I have two shots included in the prompt.

I haven't been able to get anything useful out of Phi-2 in llama.cpp (but I only tried quantized models). I use python/huggingface's transformers lib instead.


Interesting. I've had no success at all using any of the Phi2 models.


An update on my endeavour: so, model switching is very costly under llama.cpp (I have to switch between Llama and Phi2 because my GPU has low amounts of VRAM). And this switch (reloading the weights into VRAM) defeats the whole purpose of the optimization. Having only Llama on GPU without reloading takes less time than if I'd use Llama+Phi2. And Phi2 alone is pretty bad as a general purpose LLM. So I'm quite disappointed.


I recently upgraded to AM5 and as I have an AMD GPU I'm using llama.cpp on CPU only and I was positively surprised by how fast it generate stuff. I don't have the case of massive workloads so YMMV.


Hear hear! I don't understand why it has persistent mindshare, it's not even trained for chat. Meanwhile StableLM 3B runs RAG in my browser, on my iPhone, on my Pixel ..


How have you been using RAG in your browser/on your phones?


To be released, someday [sobs in engineer]

Idea is usage-based charging for non-local and a $5/month sub for syncing.

keep an eye on @jpohhhh on Twitter if you're interested

now that I got it on web, I'm hoping to at least get a PoC up soon. I've open-sourced the consitutent parts as FONNX and FLLAMA, Flutter libraries that work on all platforms. FONNX has embeddings, FLLAMA has llama.

https://github.com/Telosnex/fonnx

https://github.com/Telosnex/fllama


I tested it for an offline autocompletion tool and it was hilariously bad.


Really looking forward to the day someone puts out an open model which outperforms Flan-T5 on BoolQ.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: