Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We already got an LLM generated meta review that was very clearly just summarization of reviews. There were some pretty egregious cases of borderline hallucinated remarks. This was ACL Rolling Review, so basically the most prestigious NLP venue and the editors told us to suck it up. Very disappointing and I genuinely worry about the state of science and how this will affect people who rely on scientometric criteria.


This is a problem in general, but the unmitigated disaster that is ARR (ACL Rolling Review) doesn't help.

On the one hand, if you submit to a conference, you are forced to "volunteer" for that cycle. Which is a good idea from a "justice" point of view, but its also a sure way of generating unmotivated reviewers. Not only because a person might be unmotivated in general, but because the -rather short- reviewing period may coincide in your vacation (this happened to many people with EMNLP, whose reviewing period was in the summer) and you're not given any alternative but to "volunteer" and deal with it.

On the other hand, even regular reviewers aren't treated too well. Lately they implemented a minimum max load of 4 (which can push people towards choosing uncomfortable loads, in fact, that seems to be the purpose) and loads aren't even respected (IIRC there have been mails to the tune of "some people set a max load but we got a lot of submissions so you may get more submissions than your load, lololol").

While I don't condone using LLMs for reviewing and I would never do such a thing, I am not too surprised that these things happen given that ARR makes the already often thankless job of reviewing even more annoying.

To be honest, lately, I have gotten better quality reviews from the supposedly second-tier conferences that haven't joined ARR (e.g. this year's LREC-COLING) than from ARR. Although sample size is very small, of course.


Most conferences have been flooded with submissions, and ACL is no exception.

A consequence of that is that there are not sufficient numbers of reviewers available who are qualified to review these manuscripts.

Conference organizers might be keen to accept many or most who offer to volunteer, but clearly there is now a large pool of people that have never done this before, and were never taught how to do this. Add some time pressure, and people will try out some tool, just because it exists.

GPT-generated docs have a particular tone that you can detect if you've played a bit with ChatGPT and if you have a feel for language. Such reviews should be kicked out. I would be interested to view this review (anonymized if you like - by taking out bits that reveal too narrowly what it's about).

The "rolling" model of ARR is a pain, though, because instead of slaving for a month you feel like slaving (conducting scientific peer review free of charge = slave labor) all year round. Last month, I got contacted by a book editor to review a scientific book for $100. I told her I'm not going to read 350 pages, to write two pages worth of book review; to do this properly one would need two days, and I quoted my consulting day rate. On top of that, this email came in the vacation month of August. Of course, said person was never heard of again.


We had what we strongly suspect is an LLM-written review for NeurIPS. It was kind of subtle if you weren't looking carefully and I can see that an AC might miss it. The suggestions for improvement weren't _wrong_, but the GPT response picked up on some extremely specific things in the paper that were mostly irrelevant (other reviewers actually pointed out the odd typo and small corrections or improvemnts where we'd made statements).

Pretty hard to combat. We just rebutted as if it were a real review - maybe it was - and hope that the chairs see it. Speaking to other folks, opinions are split over whether this sort of review should be flagged. I know some people who tried to query a review and it didn't help.

There were other small cues - the English was perfect, while other reviewers made small slips indicative of non-native speakers. One was simply the discrepancy between the tone of the review (generally very positive) and the middle-of-the-road rating and confidence. The structure of the review was very "The authors do X, Y, Z. This is important because A, B, C." and the reviewer didn't bother to fill out any of the other review sections (they just wrote single-word answeres to all of them).

The kicker was actually putting our paper in to 4o and asking it to write a review and seeing the same keywords pop up.


so basically the most prestigious NLP venue

I see "dogfooding" has now been taken to its natural conclusion.


> people who rely on scientometric criteria

Not defending LLM papers at all, but these people can go to hell. If "scientometrics" was ever a good idea, after making the measure the target, it for sure isn't anymore. A longer, carefully written, comprehensive paper is rated worse than many short, incremental, hastily written papers.


Well, given that the only thing that matters for tenure reviews is the “service”, i.e., roughly a list of conferences the applicant reviewed/performed some sort of service at, this is barely a surprise.

Right now there is now incentive to do a high quality review unless the reviewer is motivated.


With NeurIPS 2024 reviews going on right now, I'm sure that a whole lot of these kind of reviews are being generated daily.


With ICLR paper deadline coming up, I guess it's worth wargaming how GPT4 would review my submission.


See my other post - we had exactly this for NeurIPS. It is definitely worth seeing what GPT says about your paper if only because it's a free review. The criticisms it gave us weren't wrong per se, they were just weakly backed up and it would still be up to a reviewer to judge how relevant they are or not. Every paper has downsides, but you need domain knowledge to judge if it's a small issue or a killer. Amusingly, our LLM-reviewer gave a much lower score than when we asked GPT to provide a rating (and also significantly lower than the other reviewers).

One example was that GPT took an explicit geographic location from a figure caption and used that as a reference point when suggesting improvements (along the lines of "location X is under-represented on this map") I assume because it places some high degree of relevance to figures and the abstract when summarising papers. I think you might be able to combat this by writing defensively - in our case we might have avoided that by saying "more information about geographic diversity may be found in X and the supplementary information"


Better yet, generate some adversarial perturbations to the text (or an invisible prompt) to cause it to give you a perfect review!


Could you share it publicly or would you face adverse consequences?

If you can please publish it and maybe post here on HN or reddit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: