henryjcee on Hacker News

Ask HN: The Proof or Bluff paper. Can "AI" do math?

Over the past 12 months I've seen lots of comments all over the place (here, X, legacy media, blogs etc.) making the case for "AI" performance on the IMO (Math Olympiad) being evidence for continued rapid increases in LLM performance. I've heard my friends who work in AI safety quote these results pretty often whenever they encounter scepticism about the coming AI singularity.

It seems to me that these comments stem from the DeepMind results from last summer[0] and February this year[1]. As I understand it, the models they're using for these tasks are very specialised to the task and also only accept formal language as input (i.e. not a textual or visual representation that a large multi-modal model could use).

I was having a read through the Proof or Bluff paper[2] this morning and while I don't think it's been reproduced yet, they found that none of the tested SOTA LLMs were able to make any meaningful progress (none scored over 5%) on solving questions in their test set. This corresponds with my limited experience in using LLMs for similar tasks. Needless to say I've not heard a peep about this paper from my AI safety friends.

My question is: How should I interpret the above? Maybe it's too cynical but my current thesis is that the DeepMind results are convenient headline-grabbers for the AI safety crowd, who are conflating the performance of a task-specific model with more general LLMs in order to make an unsubstantiated claim about progress in generalisable AI. Is that reasonable? What am I missing?

If the authors of Proof or Bluff are in here I'd also like to say thanks for doing the work on this. I can imagine that work like this isn't the sexiest but it is so refreshing seeing people take the time and care to generate some hard data about how good these models actually are. As someone considering a career switch at the moment, data like this is really useful context when trying to evaluate what the next few decades might look like.

[0] https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level

[1] https://techcrunch.com/2025/02/07/deepmind-claims-its-ai-performs-better-than-international-mathematical-olympiad-gold-medalists

[2] https://arxiv.org/abs/2503.21934v1

1henryjcee8mo ago0

Ask HN: Hiring tech roles Nigeria, any suggestions?

Hi everyone,

I'm the CTO of an early-stage saas company in the UK and we're starting operations in Nigeria. We are totally focused on building out the team there with a dedicated hiring function (the first thing we learned about doing business in Nigeria is that whatever you're doing, you _have_ to know someone to get it done right) but we'd like to get moving fast and start hiring some Nigerian tech talent.

My question is, outside of Andela/Toptal/etc. (which we're trying to avoid) could anyone recommend any platforms/jobs boards/other approaches for tech hiring that might be good places to start. Linkedin and Indeed look fairly decent but another thing we've learned about Nigeria is you don't know unless you know.

Thanks and happy to chat to anyone else doing or looking business in Nigeria, would love to hear from you.

Henry (henry@nanumo.com)

Ask HN: Consuming one-shot webhooks reliably

Over the years I've worked on a few systems that need to consume one-shot webhooks.

For whatever reason, retries aren't available/practical and even then I've seen (and er, also written) bugs where the webhook was consumed incorrectly but we returned a 204 anyway; so even if retries were available we would never have seen one.

I've seen a few products for sending webhooks (e.g. Diahook, Gowebhooks) getting discussed over the past few weeks but not anything on the other receiving side. Building a HA webhook consumption system from scratch seems like reinventing the wheel for my current project and I was wondering if anyone has any strategies/products they could recommend?

Ask HN: The Proof or Bluff paper. Can "AI" do math?

[0] https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level

[1] https://techcrunch.com/2025/02/07/deepmind-claims-its-ai-performs-better-than-international-mathematical-olympiad-gold-medalists

[2] https://arxiv.org/abs/2503.21934v1

Ask HN: Hiring tech roles Nigeria, any suggestions?

Hi everyone,

Thanks and happy to chat to anyone else doing or looking business in Nigeria, would love to hear from you.

Henry (henry@nanumo.com)

Ask HN: Consuming one-shot webhooks reliably

Over the years I've worked on a few systems that need to consume one-shot webhooks.

henryjcee

Recent submissions

Zuckerberg claims AI systems "improving themselves" (opens in new tab)

Three unrelated thoughts about working with LLMs (opens in new tab)

Ask HN: The Proof or Bluff paper. Can "AI" do math?

Show HN: A Jupyter notebook for digging into UK house price data (opens in new tab)

Ask HN: Hiring tech roles Nigeria, any suggestions?

Get started with AWS and Terraform in under an hour (opens in new tab)

Ask HN: Consuming one-shot webhooks reliably

Show HN: Merge Mamba – No more merge conflicts (opens in new tab)

Recent submissions

Zuckerberg claims AI systems "improving themselves" (opens in new tab)

Three unrelated thoughts about working with LLMs (opens in new tab)

Ask HN: The Proof or Bluff paper. Can "AI" do math?

Show HN: A Jupyter notebook for digging into UK house price data (opens in new tab)

Ask HN: Hiring tech roles Nigeria, any suggestions?

Get started with AWS and Terraform in under an hour (opens in new tab)

Ask HN: Consuming one-shot webhooks reliably

Show HN: Merge Mamba – No more merge conflicts (opens in new tab)