Item 43686515

lxgr • 5 days ago

As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between

- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)

- o3-mini (web search, CoT, canvas, but no image generation)

- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)

- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)

- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)

- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)

Why do I have to figure all of this out myself?

throwup238 • 5 days ago

> - Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)

Same here, which is a real shame. I've switched to DeepResearch with Gemini 2.5 Pro over the last few days where paid users have a 20/day limit instead of 10/month and it's been great, especially since now Gemini seems to browse 10x more pages than OpenAI Deep Research (on the order of 200-400 pages versus 20-40).

The reports are too verbose but having it research random development ideas, or how to do something particularly complex with a specific library, or different approaches or architectures to a problem has been very productive without sliding into vibe coding territory.

2 replies

qingcharles • 5 days ago

Wow, I wondered what the limit was. I never checked, but I've been using it hesitantly since I burn up OpenAI's limit as soon as it resets. Thanks for the clarity.

I'm all-in on Deep Research. It can conduct research on niche historical topics that have no central articles in minutes, which typically were taking me days or weeks to delve into.

2 replies

namaria • 5 days ago

I like Deep Research but as a historian I have to tell you. I've used it for history themes to calibrated my expectations and it is a nice tool but... It can easily brush over nuanced discussions and just return folk wisdom from blogs.

What I love most about history is it has lots of irreducible complexity and poring over the literature, both primary and secondary sources, is often the only way to develop an understanding.

3 replies

fullofbees • 4 days ago

I read Being and Time recently and it has a load of concepts that are defined iteratively. There's a lot wrong with how it's written but it's an unfinished book written a 100 years ago so, I cant complain too much.

Because it's quite long, if I asked Perplexity* to remind me what something meant, it would very rarely return something helpful, but, to be fair, I cant really fault it for being a bit useless with a very difficult to comprehend text, where there are several competing styles of reading, many of whom are convinced they are correct.

But I started to notice a pattern of where it would pull answers from some weird spots, especially when I asked it to do deep research. Like, a paper from a University's server that's using concepts in the book to ground qualitative research, which is fine and practical explications are often useful ways into a dense concept, but it's kinda a really weird place to be the first initial academic source. It'll draw on Reddit a weird amount too, or it'll somehow pull a page of definitions from a handout for some University tutorial. And it wont default to the peer reviewed free philosophy encyclopedias that are online and well known.

It's just weird. I was just using it to try and reinforce my actual reading of the text but I more came away thinking that in certain domains, this end of AI is allowing people to conflate having access to information, with learning about something.

*it's just what I have access to.

1 reply

laggyluke • 4 days ago

If you're asking an LLM about a particular text, even if it's a well-known text, you might get significantly better results if you provide said text as part of your prompt (context) instead of asking a model to "recall it from memory".

So something like this: "Here's a PDF file containing Being and Time. Please explain the significance of anxiety (Angst) in the uncovering of Being."

tekacs • 5 days ago

When I've wanted it to not do things like this, I've had good luck directing it to... not look at those sources.

For example when I've wanted to understand an unfolding story better than the news, I've told it to ignore the media and go only to original sources (e.g. speech transcripts, material written by the people involved, etc.)

2 replies

namaria • 5 days ago

Deep Search is pretty good for current news stories. I've had it analyze some legal developments in a European nation recently and it gave me a great overview.

iamacyborg • 4 days ago

That use case seems pretty self defeating when a good news source will usually try to at least validate first-party materials which an llm cannot do.

taurath • 5 days ago

LLMs seem fantastic at generalizing broad thought and is not great at outliers. It sort of smooths over the knowledge curve confidently, which is a bit like in psychology where only CBT therapy is accepted, even if there are many much more highly effectual methodologies on individuals, just not at the population level.

antman • 5 days ago

Interesting use case. My problem is that for niche subjects the crawled pages probably have not captured the information and the response becomes irrelevant. Perhaps gemini will produce better results just because it takes into account much more pages

chrisshroba • 5 days ago

I also like Perplexity’s 3/day limit! If I use them up (which I almost never do) I can just refresh the next day

1 reply

behnamoh • 5 days ago

I've only ever had to use DeepResearch for academic literature review. What do you guys use it for which hits your quotas so quickly?

2 replies

jml78 • 4 days ago

I use it for mundane shit that I don’t want to spend hours doing.

My son and I go to a lot of concerts and collect patches. Unfortunately we started collecting long after we started going to concerts.

I had a list of about 30 bands I wanted patches for.

I was able to give precise instructions on what I wanted. Deep research came back with direct links for every patch I wanted.

It took me two minutes to write up the prompt and it did all the heavy lifting.

sunnybeetroot • 4 days ago

Write a comparison between X and Y

resters • 5 days ago

I use them as follows:

o1-pro: anything important involving accuracy or reasoning. Does the best at accomplishing things correctly in one go even with lots of context.

deepseek R1: anything where I want high quality non-academic prose or poetry. Hands down the best model for these. Also very solid for fast and interesting analytical takes. I love bouncing ideas around with R1 and Grok-3 bc of their fast responses and reasoning. I think R1 is the most creative yet also the best at mimicking prose styles and tone. I've speculated that Grok-3 is R1 with mods and think it's reasonably likely.

4o: image generation, occasionally something else but never for code or analysis. Can't wait till it can generate accurate technical diagrams from text.

o3-mini-high and grok-3: code or analysis that I don't want to wait for o1-pro to complete.

claude 3.7: occasionally for code if the other models are making lots of errors. Sometimes models will anchor to outdated information in spite of being informed of newer information.

gemini models: occasionally I test to see if they are competitive, so far not really, though I sense they are good at certain things. Excited to try 2.5 Deep Research more, as it seems promising.

Perplexity: discontinued subscription once the search functionality in other models improved.

I'm really looking forward to o3-pro. Let's hope it's available soon as there are some things I'm working on that are on hold waiting for it.

6 replies

rushingcreek • 5 days ago

Phind was fine-tuned specifically to produce inline Mermaid diagrams for technical questions (I'm the founder).

2 replies

underlines • 4 days ago

I really loved Phind and always think of it as the OG perplexity / RAG search engine.

Sadly stopped my subscription, when you removed the ability to weight my own domains...

Otherwise the fine-tune for your output format for technical questions is great, with the options, the pro/contra and the mermaid diagrams. Just way better for technical searches, than what all the generic services can provide.

bsenftner • 4 days ago

Have you been interviewed anywhere? Curious to read your story.

shortcord • 5 days ago

Gemini 2.5 Pro is quite good at code.

Has become my go to for use in Cursor. Claude 3.7 needs to be restrained too much.

4 replies

artdigital • 5 days ago

Same here, 2.5 Pro is very good at coding. But it’s also cocky and blames everything but itself for something not working. Eg “the linter must be wrong you should reinstall it”, “looks to be a problem with the Go compiler”, “this function HAS to exist, that’s weird that we’re getting an error”

And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.

But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7

valenterry • 5 days ago

Weird. For me, sonnet 3.7 is much more focussed and in particular works much better when finding the places that needs change and using other tooling. I guess the integration in cursor is just much better and more mature.

behnamoh • 5 days ago

This. sonnet 3.7 is a wild horse. Gemini 2.5 Pro is like a 33 yo expert. o1 feels like a mature, senior colleague.

benhurmarcel • 4 days ago

I find that Gemini 2.5 Pro tends to produce working but over-complicated code more often than Claude 3.7.

1 reply

torginus • 4 days ago

Which might be a side-effect of the reasoning.

In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.

In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.

motoboi • 5 days ago

You probably know this but it can already generate accurate diagrams. Just ask for the output in a diagram language like mermaid or graphviz

3 replies

bangaladore • 5 days ago

My experience is it often produces terrible diagrams. Things clearly overlap, lines make no sense. I'm not surprised as if you told me to layout a diagram in XML/YAML there would be obvious mistakes and layout issues.

I'm not really certain a text output model can ever do well here.

2 replies

resters • 5 days ago

FWIW I think a multimodal model could be trained to do extremely well with it given sufficient training data. A combination of textual description of the system and/or diagram, source code (mermaid, SVG, etc.) for the diagram, and the resulting image, with training to translate between all three.

1 reply

bangaladore • 5 days ago

Agreed. Even simply I'm sure a service like this already exists (or could easily exist) where the workflow is something like:

1. User provides information

2. LLM generates structured output for whatever modeling language

3. Same or other multimodal LLM reviews the generated graph for styling / positioning issues and ensure its matches user request.

4. LLM generates structured output based on the feedback.

5. etc...

But you could probably fine-tune a multimodal model to do it in one shot, or way more effectively.

behnamoh • 5 days ago

I had a latex tikz diagram problem which sonnet 3.7 couldn't handle even after 10 attempts. Gemini 2.5 Pro solved it on the second try.

1 reply

gunalx • 5 days ago

Had the same experience. o3-mini failing misreably, claude 3.7 as well, but gemini 2.5 pro solved it perfectly. (image of diagram without source to tikz diagram)

resters • 5 days ago

I've had mixed and inconsistent results and it hasn't been able to iterate effectively when it gets close. Could be that I need to refine my approach to prompting. I've tried mermaid and SVG mostly, but will also try graphviz based on your suggestion.

antman • 5 days ago

Plantuml (action) diagrams are my go to

wavewrangler • 5 days ago

You probably know this and are looking for consistency but, a little trick I use is to feed the original data of what I need as a diagram and to re-imagine, it as an image “ready for print” - not native, but still a time saver and just studying with unstructured data or handles this surprisingly well. Again not native…naive, yes. Native, not yet. Be sure to double check triple check as always. give it the ol’ OCD treatment.

barrkel • 4 days ago

Gemini 2.5 is very good. Since you have to wait for reasoning tokens, it takes longer to come back, but the responses are high quality IME.

czk • 5 days ago

re: "grok-3 is r1 with mods" -- do you mean you believe they distilled deepseek r1? that was my assumption as well, though i thought it more jokingly at first it would make a lot of sense. i actually enjoy grok 3 quite a lot, it has some of the most entertaining thinking traces.

StephenAshmore • 5 days ago

> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers

Ha! That's the funniest and best description of 4.5 I've seen.

cafeinux • 5 days ago

> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)

Is that an LLM hallucination?

4 replies

cheschire • 5 days ago

It’s a tongue in cheek reference to how audiophiles claim to hear differences in audio quality.

SadTrombone • 5 days ago

Pretty dark times on HN, when a silly (and obvious) joke gets someone labeled as AI.

1 reply

netdevphoenix • 4 days ago

Obvious to you perhaps not to everyone. Self-awareness goes a long way

lxgr • 5 days ago

Possibly, but it's running on 100% wetware, I promise!

divan • 4 days ago

Looks like NDA violation )

SweetSoftPillow • 5 days ago

Switch to Gemini 2.5 Pro, and be happy. It's better in every aspect.

2 replies

exadeci • 3 days ago

It's somehow not, I've been asking it the same questions as ChatGPT and the answers feel off.

miroljub • 5 days ago

Warning to potential users: it's Google.

1 reply

tomalbrc • 5 days ago

Not sure how or why OpenAI would be any better?

1 reply

miroljub • 4 days ago

It's not. It's closed source. But Google is still the worst when it comes to privacy.

I prefer to use only open source models that don't have the possibility to share my data with a third party.

1 reply

jrk • 4 days ago

The notion that Google is worse at carefully managing PII than a Wild West place like OpenAI (or Meta, or almost any major alternative) is…not an accurate characterization, in my experience. Ad tech companies (and AI companies) obsessively capture data, but Google internally has always been equally obsessive about isolating and protecting that data. Almost no one can touch it; access is highly restricted and carefully managed; anything that even smells adjacent to ML on personal data has gotten high-level employees fired.

Fully private and local inference is indeed great, but of the centralized players, Google, Microsoft, and Apple are leagues ahead of the newer generation in conservatism and care around personal data.

cr4zy • 5 days ago

For code it's actually quite good so far IME. Not quite as good as Gemini 2.5 Pro but much faster. I've integrated it into polychat.co if you want to try it out and compare with other models. I usually ask 2 to 5 models the same question there to reduce the model overload anxiety.

rockwotj • 5 days ago

My thoughts is this model release is driven by the agentic app push if this year. Since to my knowledge all the big agentic apps (cursor, bolt, shortwave) that I know of use claude 3.7 because it’s so much better at instruction following and tool calling than GPT 4o so this model feels like GPT 4o (or distilled 4.5?) with some post training focusing on what these agentic workloads need most

anshumankmr • 5 days ago

Hey also try out Monday, it did something pretty cool. Its a version of 4o which switched between reasoning and plain token generation on the fly. My guess is that is what GPT V will be.

lucaskd • 4 days ago

I'm also very curious of each limit for each model. Never thought about limit before upgrading my plan

youssefabdelm • 4 days ago

Disagree. It's really not complicated at all to me. Not sure why people make a big fuss over this. I don't want an AI automating which AI it chooses for me. I already know through lots of testing intuitively which one I want.

If they abstract all this away into one interface I won't know which model I'm getting. I prefer reliability.

yousif_123123 • 4 days ago

I do like the vinyl and analog amplifiers. I certainly hear the warmth in this case.

xnx • 4 days ago

This sounds like whole lot of mental overhead to avoid using Gemini.

guillaume8375 • 4 days ago

What do you mean when you say that 4o doesn’t have chain-of-thought?

fragmede • 5 days ago

what's hilarious to me is that I asked ChatGPT about the model names and approachs and it did a better job than they have.

chrisandchris • 5 days ago

Just ask the first AI that comes to mind which one you could ask.

konart • 5 days ago

Must be weird to not have an "AI router" in this case.