Like on ChatGPT, the LLM's code output can be run in an interpreter to and observe the output from it. If there are errors, it can try to rectify them based on the error message. Perhaps either via OpenAI's implementation or perhaps something based on a sandboxed JavaScript instance in the browser (can't read/write, modify the DOM, etc.). Or something using Wolfram.
LLMs are not good for mathematical questions. They could substitute it with the interpreter.