OpenAI implementation is incorrect (temperature seems to be set at 1 ? )

K_R · May 22, 2024, 4:58pm

Hi,

I’m an AI developer working extensively with the OpenAI API. I was excited to see the implementation of the AI feature here.

Unfortunately, I’ve encountered several issues with its implementation. The responses I receive are highly inconsistent and vary significantly with each query, particularly when using the aggregation AI feature. This suggests that the OpenAI API’s temperature setting might be set to 1, while it should ideally be set to 0 for programming tasks.

Could this be investigated and addressed? Alternatively, could we be given the option to adjust the temperature value ourselves through the OpenAI API integration settings?

For context:

A temperature setting of 1 is intended for creative writing, where randomness is desired.
With a temperature of 1, the AI will consistently produce varied responses.
However, for a programming AI, which requires precision and consistency, a temperature setting of 0 is essential to ensure the same accurate code is returned for identical queries.

In my projects, I always set the AI temperature to 0 to achieve the required accuracy.

For more information, please refer to the OpenAI help section: Best Practices for Prompt Engineering with the OpenAI API. Here is a relevant excerpt:

temperature - A measure of how often the model outputs a less likely token. The higher the temperature, the more random (and usually creative) the output. This, however, is not the same as “truthfulness”. For most factual use cases such as data extraction, and truthful Q&A, the temperature of 0 is best.

As a final note, the current implementation results in responses that are often problematic, rendering the AI feature virtually unusable. I would even suggest that this issue be flagged as a bug, with the temperature setting at 1 likely being the root cause of this erratic behavior.

Could this issue be looked into and resolved, please?

Thank you.

andrew · May 23, 2024, 9:15am

Hi K_R, you’re not wrong. We are not using temperature, but rather topP as the API itself says not to use both at the same time, but the value is actually quite high and could probably be lowered. Using either topP or temperature at 0 will be a mistake almost for sure though, as OpenAI often has errors or mistakes in its response, so if temperature is 0 then you will never get another answer that might possibly be without errors since now answers are deterministic, whereas right now you are able to, by asking again.

We will flag this however and look at making some adjustments. Thanks for writing in about it.

K_R · May 23, 2024, 5:01pm

Hello Andrew,

I understand your concerns about invalid responses and would like to address a few points:

For the sake of my reply, I want to point out that I personally only adjust temperature, never topP. So, I’ll be referring to temperature, but the principle remains the same as we are discussing randomness vs. deterministic outcomes.

First of all, the models have significantly improved over the past year.
I’m quickly assuming here you might have initially tested everything with GPT-3.5, but its coding quality was very different from the newer models we are using today. I believe your concerns regarding code quality are no longer necessary.

Increasing randomness however, will never lead to more accurate responses; worse even, it will always reduce accuracy even further.
As a prompt engineer, I’ve found that at a temperature of 0, incorrect code is usually due to my own prompt itself. GPT interprets prompts literally, and if the code is incorrect, it’s often because the prompt is unclear or incorrect. Correcting the prompt usually resolves the issue.
I’ve seen many cases of users getting bad code as a response, simply because GPT was excellent at following their poorly written instructions that were full of ambiguous language.

In other words, it’s not your concern to cover for the bad prompting of a user, that’s the end-users responsibility. You can and should not try to negate a (potentially) badly worded question through increased randomness.

Adjusting temperature won’t fix prompt mistakes. Increased randomness may occasionally produce the desired code by chance, but that would be a rare exception.

A real example:
I recently encountered an issue where GPT slightly altered collection names, causing errors. This wouldn’t have happened at a temperature of 0, but now it did due to higher randomness. Not ideal.

As OpenAI states:

For most factual use cases, such as data extraction and truthful Q&A, a temperature of 0 is best.

And that’s what we want, right?
We need truthful results, especially for tasks like writing aggregation pipelines.

Actually we can easily bypass the whole “what would be the best setting” investigation anyway, by allowing us to have our own custom overrides for settings like temperature and topP in the “AI Helper” configuration box. That would be ideal and would be an easy fix.
Additionally, when you are at it, you could even have us choose the model of our preference, but that’s more of a nice-to-have feature for me personally. I think a customizable temperature and topP fields would suffice.

PS:
I’d like to share one final note on prompt engineering:
Adding a starting instruction like: “Answer from the perspective of a MongoDB expert” can significantly improve accuracy. This is part of prompt engineering and will yield much better results. However, this advantage is negated when randomness is increased. Increasing randomness is essentially telling GPT to “add some randomness to whatever you think is a good response,” which is the opposite of what we want and need here.
Great for writing a fiction story full of plot twists… but not so great for an aggregation pipeline… full of plot twists.

As it stands, I’ve been forced to stop using your AI implementation and instead am using my own GPT implementations (with temperature set at 0) to generate aggregation pipelines (and copy-paste them to 3T). Their quality is far beyond what I can produce through your interface.
Note I’m just using the exact same API ~~and same model~~ * , but just different settings and some slightly different default prompt as well.

I assure you that GPT is excellent at writing MongoDB code, except when I’m using the Studio 3T AI implementation.

I hope this additional information is helpful to you.

I’m editing my post here with another recent finding:
* Whoops, I just noticed Studio 3T is still hooked to GPT 3.5. That’s another major issue.
Please be aware GPT 3.5 was not properly trained on following instructions, especially tool and function calls.
And it’s reasoning capabilities are not sufficient, compared to GPT 4 and beyond.

K_R · May 24, 2024, 9:02am

Sorry for yet another response,
but I’d like to point out another issue I’ve discovered with your implementation:

It appears you aren’t using the function calling of GPT at all ?
For more information on how to implement function calls in your feature:

https://platform.openai.com/docs/guides/function-calling

The cues I picked up to assume this hasn’t been implemented:
I notice some AI responses are containing text which leads to parsing errors of Studio 3T.
That’s something that is literally impossible if you would be using function-calling since function-calling guarantees to consistently produce a response in JSON format.

The prompt data that you are adding is:
“Answer only with a MongoDB aggregation query to answer the following question:”

This is not how you are supposed to call the OpenAI API.

Instead, what you should be doing in the background is sending the users’ question, together with a function tool call as an extra parameter, which could look like this:

  {
    "type": "function",
    "function": {
      "name": "MongoDB_Aggregation_Query",
      "description": "Answer only with a MongoDB aggregation query to answer the question from the user",
      "parameters": {
        "type": "object",
        "properties": {
          "aggregationPipeline": {
            "type": "array",
            "description": "The MongoDB aggregation query you're writing, based on the question from the user."
          }
        },
        "required": ["aggregationPipeline"]
      }
    }
  },

If you provide this as a possible tool_call for GPT to use,
GPT is guaranteed to respond with an aggregation query in JSON format.
And additionally, GPT will become more aware it’s really working with MongoDB aggregation queries, which in turn improves the quality of the queries even further due to its “headspace” ( or “attention” if you will ) getting moved even deeper into the MongoDB world - improving accuracy.

This is a rough example to get you going, but of course you would need to properly analyze and implement the wording. I’ve possibly used a wrong type and stuff like that, that’s details you’d need to analyze yourself, but it’s to give you an idea.

Additional information copied from the OpenAI help section:

The basic sequence of steps for function calling is as follows:

Call the model with the user query and a set of functions defined in the functions parameter.

The model can choose to call one or more functions; if so, the content will be a stringified JSON object adhering to your custom schema (note: the model may hallucinate parameters).

Parse the string into JSON in your code, and call your function with the provided arguments if they exist.

~~Call the model again by appending the function response as a new message, and let the model summarize the results back to the user.~~ (*)

(*) Note that you won’t need to implement step 4 for the Studio 3T software.

And the function calling behavior can be influenced as well:

To force the model to always call one or more functions, you can set tool_choice: "required". The model will then select which function(s) to call.

This further guarantees GPT will call your constructed function (“MongoDB_Aggregation_Query” in this case).

I understand that you have not implemented this since these are new OpenAI API features that were released a few months after your initial implementation.
The world of AI improves at a fast rate

andrew · May 27, 2024, 11:41am

Thank you for all the feedback, your input is greatly appreciated. I’ve passed this along to the development team for consideration for our next iteration of the AI Helper.