Working with LLM APIs: first calls, tokens, and structured output

Your first call to a language-model API is simpler than it looks. Once you know a few basics — roles, tokens, temperature, and structured output — the rest falls into place.

Using a language model through an API is, in practice, sending a message and getting a response. But there are a few basics that, once you know them, make you fluent from the very first call. Let’s go through them.

The anatomy of a call

Every call has three parts: the endpoint (where to send it), the API key (authentication), and the message payload (what you want). The response has a status code (for example 200 for success, 401 for an auth error, 429 for a rate limit) and a body containing the model’s reply.

Roles: system, user, assistant

The most important point is that the model has no memory. Each time, you have to send the whole history it needs yourself. That history is expressed as a list of messages with three roles:

system: the overall instruction that sets the personality and rules once, at the start.
user: the user’s message.
assistant: the model’s previous replies, which you send back to preserve conversational continuity.

messages = [
    {"role": "system", "content": "You are a math teacher. Be patient."},
    {"role": "user", "content": "What is 2 plus 2?"},
    {"role": "assistant", "content": "2 plus 2 is 4."},
    {"role": "user", "content": "Now multiply that by 3."},
]

Because the assistant message is in the history, the model understands that “that” means 4.

Tokens and cost

The model breaks text into units called tokens. A rule of thumb: in English each token is about three-quarters of a word; Persian uses more tokens (roughly one token per two characters). Cost is calculated per token, and input and output are priced separately. The context window is the maximum number of tokens the model can see at once; for modern models this ranges from about a hundred thousand to over a million tokens.

For long conversations that exceed the context window, you have two options: a sliding window (keep only the last few messages) or summarisation (compress the old messages into a summary).

Temperature and a few key parameters

Temperature controls how random the output is, usually from 0 to 2. A value of 0 always picks the most likely option (good for coding and data extraction); 1 is balanced; and higher values are for creative writing. One rule: change either temperature or top_p, not both at once.

Another important parameter is the max output tokens. This caps the length of the reply — and if the model reaches it, the reply is cut off mid-sentence rather than summarised. So set it to match the task.

Structured output

Often you want to process the reply as data, not free text. There are two ways to do this.

First, JSON mode: you ask the model to return output as JSON only. The catch is that you must mention the word JSON in the instruction.

response = client.chat.completions.create(
    model="...",
    messages=[
        {"role": "system", "content": "Return output as JSON only."},
        {"role": "user", "content": "Extract: iPhone 15, price $999"},
    ],
    response_format={"type": "json_object"},
)
data = json.loads(response.choices[0].message.content)

This guarantees valid JSON, but not that the structure is correct (such as all fields being present). Second, function calling: you define a function signature and the model is forced to return output matching that structure exactly — more precise and more reliable for structured data.

Putting it together

A first API call is just these few pieces: messages with roles, an awareness of tokens and cost, a few parameters to control behaviour, and a way to get structured output. Once you have these basics in hand, the rest of the capabilities are just add-ons.