Typewriter effect with Svelte, Fast API and IBM watsonx.ai

3 min readJan 14, 2024

I often encountered the situation whereby IBM watsonx.ai or Generative AI API interface takes more than 20 seconds to fully generate the response. The length and duration of the response are subjected to the number of generated tokens and the number of tokens is subjected to the instruction or prompt of the Generative AI model. In short, the more tokens are expected to be generated by Generative AI, the longer it takes to fully generate the complete output.

User experience is an important element in Generative AI engagement. In the age of information overloading — waiting sucks. The design of the loading animation in the application makes the waiting slightly bearable, but it may not be suitable in the Generative AI type of application. With the popularity of ChatGPT interface, typewriter-style is widely acceptable by streaming the generated token on a real-time basis. In this post, I am sharing an example how to use Svelte, Fast API and IBM watsonx.ai to achieve the same.

IBM watsonx.ai Streaming Interface

Similar to OpenAI completion API, IBM watsonx.ai provides streaming API interface. This interface can be assessed via IBM Machine Learning python SDK. For this example I have a simple system prompt for Llama2 as below:

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>> 

{question} [/INST]

In Python, the model receives the message/question from UI inserts it into the system prompt, and posts it to IBM watsonx.ai. The return response will be in the form of streamed text.

model.generate_text_stream(prompt=system_prompt.replace("{question}",payload["message"])

More information is available at: https://ibm.github.io/watson-machine-learning-sdk/model.html#ibm_watson_machine_learning.foundation_models.Model.generate_text_stream

Fast API

In Fast API, the default behavior of API will be returned standard HTML payload — HTML code, text, file and so on. Fast API supports streaming response by implementing StreamingResponse . In this case, we will wrap around generate_text_stream with StreanmingResponse in Fast API. For example below, I re-used the same Carbon Portal app from previous post.

@app.post("/chat")
async def api(request: Request, Verifcation = Depends(verification)):
    #If The Verification Function Successfuylly Authenticates The User Then It Will Run The Below Code
    if Verifcation:
        payload = await request.json()
        print("message = " + payload["message"])
        return StreamingResponse(model.generate_text_stream(prompt=system_prompt.replace("{question}",payload["message"])), media_type='text/event-stream')
    else:
        return {"output": "Not authenticated"}

More information about Fast API StreamingResponse is available at: https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse

Carbon Components Svelte

In UI, I created additional ChatSession , ChatBox , and ChatInput . ChatSession handles the interaction of posting to IBM watsonx.ai and its streaming response. ChatBox handles the rendering of chat messages. ChatInput handles the typing of chat. As Fast API return StreamingResponse , UI’s fetch function also needs to be modified to handle streaming. The data in streaming output is in the form of byte-array or binary encoding, and hence you will need to decode it before rendering the text on screen. TextDecoderStream is designed to perform the conversation seamlessly (https://developer.mozilla.org/en-US/docs/Web/API/TextDecoderStream).

const sendMessage = async (message: string) => {
 console.log(message);
 const response = await fetch(url + "chat", {
  method: 'POST',
  headers: headers,
        body: JSON.stringify({
            message: message
        })
 });
 let result = "";
 // @ts-ignore
 const reader = response.body.pipeThrough(new TextDecoderStream()).getReader();
 return reader
    
};

How do you know the streaming is completed? Good question, the answer is getReader() . getReader() will give a true value if the stream has given all its data. Therefore a logic is needed to detect true to indicate the sentence is generated completely. The rest of coding is to update the object in Svelte so that it renders on UI in a real-time basis to create the typewriter effect.

while (true) {
   const { value, done } = await response.read();
   //console.log("resp", done, value);
   if (done) {
    let d = current[current.length - 1];
    d.text = d.text.slice(0, -1);
    current[current.length - 1] = d;
    current = current;
    break;
   }
   result += `${value}`;
   //console.log(result)
   generatedOutput = result + "_";
   let d = current[current.length - 1];
   d.text = generatedOutput;
   current[current.length - 1] = d;
   current = current;
}

The final build looks something like below:

The full source code is at https://github.com/ongkhaiwei/carbon-portal-chat

Typewriter effect with Svelte, Fast API and IBM watsonx.ai

IBM watsonx.ai Streaming Interface

Fast API

Carbon Components Svelte

Written by Ong Khai Wei

No responses yet