Typewriter effect with Svelte, Fast API and IBM watsonx.ai
I often encountered the situation whereby IBM watsonx.ai or Generative AI API interface takes more than 20 seconds to fully generate the response. The length and duration of the response are subjected to the number of generated tokens and the number of tokens is subjected to the instruction or prompt of the Generative AI model. In short, the more tokens are expected to be generated by Generative AI, the longer it takes to fully generate the complete output.
User experience is an important element in Generative AI engagement. In the age of information overloading — waiting sucks. The design of the loading animation in the application makes the waiting slightly bearable, but it may not be suitable in the Generative AI type of application. With the popularity of ChatGPT interface, typewriter-style is widely acceptable by streaming the generated token on a real-time basis. In this post, I am sharing an example how to use Svelte, Fast API and IBM watsonx.ai to achieve the same.
IBM watsonx.ai Streaming Interface
Similar to OpenAI completion API, IBM watsonx.ai provides streaming API interface. This interface can be assessed via IBM Machine Learning python SDK. For this example I have a simple system prompt for Llama2 as below:
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>
{question} [/INST]
In Python, the model receives the message/question from UI inserts it into the system prompt, and posts it to IBM watsonx.ai. The return response will be in the form of streamed text.
model.generate_text_stream(prompt=system_prompt.replace("{question}",payload["message"])
More information is available at: https://ibm.github.io/watson-machine-learning-sdk/model.html#ibm_watson_machine_learning.foundation_models.Model.generate_text_stream
Fast API
In Fast API, the default behavior of API will be returned standard HTML payload — HTML code, text, file and so on. Fast API supports streaming response by implementing StreamingResponse
. In this case, we will wrap around generate_text_stream
with StreanmingResponse
in Fast API. For example below, I re-used the same Carbon Portal
app from previous post.
@app.post("/chat")
async def api(request: Request, Verifcation = Depends(verification)):
#If The Verification Function Successfuylly Authenticates The User Then It Will Run The Below Code
if Verifcation:
payload = await request.json()
print("message = " + payload["message"])
return StreamingResponse(model.generate_text_stream(prompt=system_prompt.replace("{question}",payload["message"])), media_type='text/event-stream')
else:
return {"output": "Not authenticated"}
More information about Fast API StreamingResponse is available at: https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse
Carbon Components Svelte
In UI, I created additional ChatSession
, ChatBox
, and ChatInput
. ChatSession
handles the interaction of posting to IBM watsonx.ai and its streaming response. ChatBox
handles the rendering of chat messages. ChatInput
handles the typing of chat. As Fast API return StreamingResponse
, UI’s fetch
function also needs to be modified to handle streaming. The data in streaming output is in the form of byte-array or binary encoding, and hence you will need to decode it before rendering the text on screen. TextDecoderStream
is designed to perform the conversation seamlessly (https://developer.mozilla.org/en-US/docs/Web/API/TextDecoderStream).
const sendMessage = async (message: string) => {
console.log(message);
const response = await fetch(url + "chat", {
method: 'POST',
headers: headers,
body: JSON.stringify({
message: message
})
});
let result = "";
// @ts-ignore
const reader = response.body.pipeThrough(new TextDecoderStream()).getReader();
return reader
};
How do you know the streaming is completed? Good question, the answer is getReader()
. getReader()
will give a true
value if the stream has given all its data. Therefore a logic is needed to detect true
to indicate the sentence is generated completely. The rest of coding is to update the object in Svelte so that it renders on UI in a real-time basis to create the typewriter effect.
while (true) {
const { value, done } = await response.read();
//console.log("resp", done, value);
if (done) {
let d = current[current.length - 1];
d.text = d.text.slice(0, -1);
current[current.length - 1] = d;
current = current;
break;
}
result += `${value}`;
//console.log(result)
generatedOutput = result + "_";
let d = current[current.length - 1];
d.text = generatedOutput;
current[current.length - 1] = d;
current = current;
}
The final build looks something like below:
The full source code is at https://github.com/ongkhaiwei/carbon-portal-chat