Hello Andrew,
I understand your concerns about invalid responses and would like to address a few points:
For the sake of my reply, I want to point out that I personally only adjust temperature, never topP. So, I’ll be referring to temperature, but the principle remains the same as we are discussing randomness vs. deterministic outcomes.
First of all, the models have significantly improved over the past year.
I’m quickly assuming here you might have initially tested everything with GPT-3.5, but its coding quality was very different from the newer models we are using today. I believe your concerns regarding code quality are no longer necessary.
Increasing randomness however, will never lead to more accurate responses; worse even, it will always reduce accuracy even further.
As a prompt engineer, I’ve found that at a temperature of 0, incorrect code is usually due to my own prompt itself. GPT interprets prompts literally, and if the code is incorrect, it’s often because the prompt is unclear or incorrect. Correcting the prompt usually resolves the issue.
I’ve seen many cases of users getting bad code as a response, simply because GPT was excellent at following their poorly written instructions that were full of ambiguous language.
In other words, it’s not your concern to cover for the bad prompting of a user, that’s the end-users responsibility. You can and should not try to negate a (potentially) badly worded question through increased randomness.
Adjusting temperature won’t fix prompt mistakes. Increased randomness may occasionally produce the desired code by chance, but that would be a rare exception.
A real example:
I recently encountered an issue where GPT slightly altered collection names, causing errors. This wouldn’t have happened at a temperature of 0, but now it did due to higher randomness. Not ideal.
As OpenAI states:
For most factual use cases, such as data extraction and truthful Q&A, a temperature
of 0 is best.
And that’s what we want, right?
We need truthful results, especially for tasks like writing aggregation pipelines.
Actually we can easily bypass the whole “what would be the best setting” investigation anyway, by allowing us to have our own custom overrides for settings like temperature and topP in the “AI Helper” configuration box. That would be ideal and would be an easy fix.
Additionally, when you are at it, you could even have us choose the model of our preference, but that’s more of a nice-to-have feature for me personally. I think a customizable temperature and topP fields would suffice.
PS:
I’d like to share one final note on prompt engineering:
Adding a starting instruction like: “Answer from the perspective of a MongoDB expert” can significantly improve accuracy. This is part of prompt engineering and will yield much better results. However, this advantage is negated when randomness is increased. Increasing randomness is essentially telling GPT to “add some randomness to whatever you think is a good response,” which is the opposite of what we want and need here.
Great for writing a fiction story full of plot twists… but not so great for an aggregation pipeline… full of plot twists.
As it stands, I’ve been forced to stop using your AI implementation and instead am using my own GPT implementations (with temperature set at 0) to generate aggregation pipelines (and copy-paste them to 3T). Their quality is far beyond what I can produce through your interface.
Note I’m just using the exact same API and same model * , but just different settings and some slightly different default prompt as well.
I assure you that GPT is excellent at writing MongoDB code, except when I’m using the Studio 3T AI implementation.
I hope this additional information is helpful to you.
I’m editing my post here with another recent finding:
* Whoops, I just noticed Studio 3T is still hooked to GPT 3.5. That’s another major issue.
Please be aware GPT 3.5 was not properly trained on following instructions, especially tool and function calls.
And it’s reasoning capabilities are not sufficient, compared to GPT 4 and beyond.