• Stars
    star
    137
  • Rank 266,121 (Top 6 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created almost 2 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

openai-whisper-talk is a sample voice conversation application powered by OpenAI technologies such as Whisper, Completions, Embeddings, and the latest Text-to-Speech. The application is built using Nuxt, a Javascript framework based on Vue.js.

openai-whisper-talk

v0.0.2

openai-whisper-talk is a sample voice conversation application powered by OpenAI technologies such as Whisper, an automatic speech recognition (ASR) system, Chat Completions, an interface that simulates conversation with a model that plays the role of assistant, Embeddings, converts text to vector data that can be used in tasks like semantic searching, and the latest Text-to-Speech, that turn text ito lifelike spoken audio. The application is built using Nuxt, a Javascript framework based on Vue.js.

The application has two new features: Schedule Management and Long-Term Memory. With Schedule Management, you can command the chatbot to add, modify, delete, and retrieve scheduled events. The Long-Term Memory feature allows you to store snippets of information that the chatbot will remember for future reference. You can seamlessly integrate both functions into your conversations simply by interacting with the chatbot.

Update: Updated the openai module to latest version and replaced the embedding model from text-embedding-ada-002 to the new v3 model text-embedding-3-small.


openai-whisper-talkは、Whisper(自動音声認識(ASR)システム)、Chat completions(アシスタントの役割を果たすモデルとの会話をシミュレートするインターフェース)、Embeddings(セマンティック検索などのタスクで使用できるベクターデータにテキストを変換する)、そして最新のText-to-speech(テキストをリアルな話し言葉のオーディオに変える)など、OpenAIの技術を駆使したサンプル音声会話アプリケーションです。このアプリケーションは、Vue.jsに基づいたJavascriptフレームワークであるNuxtを使用して構築されています。

このアプリケーションには、「スケジュール管理」と「永続メモリ」の2つの新機能があります。スケジュール管理を使用すると、チャットボットにスケジュールイベントの追加、変更、削除、取得を指示できます。永続メモリ機能を使用すると、将来の参照のためにチャットボットが覚えておく情報のスニペットを保存できます。これらの機能をチャットボットとの対話を通じてシームレスに統合することができます。将来的に、いくつかの機能強化、たとえばメールやメッセージング機能を追加することで、完全な個人アシスタントになるかもしれません。

The App

Main screen

From the main page, you can choose which chatbot to engage with. Each chatbot has a distinct personality, speaks a different language, and possesses a unique voice. You can alter the name and personality of any chatbot by clicking the Edit button adjacent to its name. Currently, the user interface does not directly support the addition of new chatbots; however, you can manually add a chatbot and customize the voice and language settings for each by modifying the assets/contacts.json file.

Edit Friend

Additionally, you can personalize your profile by clicking on the avatar icon in the upper right corner. This allows you to enter your name and share details about yourself, enabling the chatbot to interact with you in a more personalized manner.

Edit User

Audio Capture

Talk screen

Audio data is automatically recorded if sound is detected. A threshold setting is available to prevent background noise from triggering the audio capture. By default, this is set to -45dB (with -10dB representing the loudest sound cutoff). You can adjust this threshold by modifying the MIN_DECIBELS1 variable according to your needs.

When recording is enabled and no sound is detected for 3 seconds, the audio data is uploaded and sent to the backend for transcription. It’s worth noting that in typical conversations, the average gap between each turn is around 2 seconds. Similarly, the pause between sentences when speaking is approximately the same. Therefore, I’ve chosen a duration that’s long enough to mean to wait for a reply. You can adjust this duration by editing the MAX_COUNT1 variable.

The system can continuously record audio data until a reply is received. Once the audio reply from the chatbot is received and played, audio recording is disabled to prevent inadvertent recording of the chatbot’s own response.

A text input is also provided if you want to write your messages.

Whisper

All recorded audio data is uploaded to the public/upload directory in WebM file format. Before submitting the audio file to the Whisper API, it is necessary to remove all silent segments. This step helps prevent the well-known issue of hallucinations generated by Whisper. For this same reason, it is recommended to set the MIN_DECIBELS value as high as possible, ensuring that only speech is recorded.

Remove Silent Parts From Audio

To remove silent parts from the audio, we will be using ffmpeg. Be sure to install it.

ffmpeg -i sourceFile -af silenceremove=stop_periods=-1:stop_duration=1:stop_threshold=-50dB outputFile

In this command:

  • -1 sourceFile specifies the input file.
  • -af silenceremove applies the filter silencerremove.
  • stop_periods=-1 removes all periods of silence.
  • stop_duration=1 sets any period of silence longer than 1 second as silence.
  • stop_threshold=-50dB sets any noise level below -50dB as silence.
  • outputFile the output file.

To invoke this shell command in our api route, we will be using exec from child-process module.

After removing the silent parts, the file size should be checked. The final file size is usually much smaller than the original. During this check, files smaller than 16 KB are ignored, assuming that a file of this size, with a 16-bit depth, equates to roughly one second of audio. Anything shorter is likely inaudible.

// simple file size formula
const audio_file_size = duration_in_seconds * bitrate

The entire process, from eliminating silent parts to checking file size, is designed to ensure that only viable audio data is sent to the Whisper API. Our aim is to avoid hallucination and the unnecessary transmission of data.

Having confirmed the viability of our audio data, we then proceed to call the Whisper API.

const transcription = await openai.audio.transcriptions.create({
    file: fs.createReadStream(filename),
    language: lang,
    response_format: 'text',
    temperature: 0,
})

where lang is the ISO 639-1 code of the specified language of our chatbot. Please check the list of the currently supported language.

We opt for a simple text format since timestamps are not required, and we set the temperature parameter to zero to achieve deterministic output.

Chat Completions API

After receiving the transcript from Whisper, we proceed to submit it to the Chat Completions API with function calling. We utilize the latest OpenAI Node.js module (version 4) that was released yesterday (2023/11/07), which includes an updated function-calling format. This newest iteration enables the invocation of multiple functions in a single request.

let messages = [{ role: 'system', content: system_prompt }]

let all_history = await mongoDb.getMessages()
if(all_history.length > 0) {
    const history_context = trim_array(all_history, 20)
    messages = messages.concat(history_context)
}

messages.push({ role: 'user', content: user_query })

let response = await openai.chat.completions.create({
    temperature: 0.3,
    messages,
    tools: [
        { type: 'function', function: add_calendar_entry },
        { type: 'function', function: get_calendar_entry },
        { type: 'function', function: edit_calendar_entry },
        { type: 'function', function: delete_calendar_entry },
        { type: 'function', function: save_new_memory },
        { type: 'function', function: get_info_from_memory }
    ]
})

Let’s dissect the various components of this call in the subsequent sections.

The System Prompt

The system prompt plays a crucial role in giving life to our chatbot. It is here where we establish its name and persona, based on the chatbot chosen by the user. We provide it with specific instructions on how to respond, along with a list of functions it can execute. We also inform it about the user's identity and some personal details. Finally, we set the current date and time, which is essential for activating calendar functions.

const today = new Date()

let system_prompt = `In this session, we will simulate a voice conversation between two friends.\n\n` +
    
    `# Persona\n` +
    `You will act as ${selPerson.name}.\n` +
    `${selPerson.prompt}\n\n` +
    `Please ensure that your responses are consistent with this persona.\n\n` +

    `# Instructions\n` +
    `The user is talking to you over voice on their phone, and your response will be read out loud with realistic text-to-speech (TTS) technology.\n` +
    `Use natural, conversational language that are clear and easy to follow (short sentences, simple words).\n` +
    `Keep the conversation flowing.\n` +
    `Sometimes the user might just want to chat. Ask them relevant follow-up questions.\n\n` +
    
    `# Functions\n` +
    `You have the following functions that you can call depending on the situation.\n` +
    `add_calendar_entry, to add a new event.\n` +
    `get_calendar_entry, to get the event at a particular date.\n` +
    `edit_calendar_entry, to edit or update existing event.\n` +
    `delete_calendar_entry, to delete an existing event.\n` +
    `save_new_memory, to save new information to memory.\n` +
    `get_info_from_memory, to retrieve information from memory.\n\n` +

    `When you present the result from the function, only mention the relevant details for the user query.\n` +
    `Omit information that is redundant and not relevant to the query.\n` +
    `Always be concise and brief in your replies.\n\n` +

    `# User\n` +
    `As for me, in this simulation I am known as ${user_info.name}.\n` +
    `${user_info.prompt}\n\n` +

    `# Today\n` +
    `Today is ${today}.\n`

Context

All messages will be stored in MongoDB.

To manage tokens and avoid surpassing the model's maximum limit, we will only send the last 20 interactions. The trim_array function is designed to trim the message history if it surpasses 20 turns. This threshold can be adjusted to meet your specific requirements.

// allow less than or equal to 15 turns
const history_context = trim_array(all_history, 15) 

From the main screen, you have the option to erase the previous history for each chatbot.

Clear messages

Function Calling

Please note that we have separated the handling of function calling (function_call.js) from the main Chat Completions API call (transcribe.php). This distinction is made to address instances when text content is present while a function call is invoked, which is typically null. This separation enables the app to display the text while simultaneously processing the function call. I have also enclosed the process within a loop in the event that a second API call results in another function call. This could probably be handled more elegantly by implementing streaming, but I have yet to learn how to use streaming with Nuxt.

Our functions are categorized under two new features: Schedule Management and Long-Term Memory.

Let's first examine how we manage function calling with the new format. We must utilize the new tools parameter instead of the now-deprecated functions parameter to enable the invocation of multiple function calls.

let response = await openai.chat.completions.create({
    temperature: 0.3,
    messages,
    tools: [
        { type: 'function', function: add_calendar_entry },
        { type: 'function', function: get_calendar_entry },
        { type: 'function', function: edit_calendar_entry },
        { type: 'function', function: delete_calendar_entry },
        { type: 'function', function: save_new_memory },
        { type: 'function', function: get_info_from_memory }
    ]
})

Check the JSON Schema of each functions from lib/ directory. Here is for add_calendar_entry:

{
    "name": "add_calendar_entry",
    "description": "Adds a new entry to the calendar",
    "parameters": {
        "type": "object",
        "properties": {
            "event": {
                "type": "string",
                "description": "The name or title of the event"
            },
            "date": {
                "type": "string",
                "description": "The date of the event in 'YYYY-MM-DD' format"
            },
            "time": {
                "type": "string",
                "description": "The time of the event in 'HH:MM' format"
            },
            "additional_detail": {
                "type": "string",
                "description": "Any additional details or notes related to the event"
            }
        },
        "required": [ "event", "date", "time", "additional_detail" ]
    }
}

To dicuss how all these works, let's move on to the next section.

Schedule Management

For Schedule Management, we have the following functions:

All calendar entries will be stored in MongoDB. Please note that all entries will be accessible to all chatbots.

Sample Conversation - Schedule Management

Let’s take a look at a sample chat conversation to demonstrate how these elements interact:

user: good morning, jeeves. what's my schedule for today?

Function calling (invoke get_calendar_entry):

{                                                                                                                                   role: 'assistant',
  content: null,
  tool_calls:[{
    id: 'call_4cCtmNlgYN5o4foVOQM9MIdA',
    type: 'function',
    function: { name: 'get_calendar_entry', arguments: '{"date":"2023-11-10"}' }
  }]
}

Function response:

[                                                                                                                                 
  {
    tool_call_id: 'call_4cCtmNlgYN5o4foVOQM9MIdA',
    role: 'tool',
    name: 'get_calendar_entry',
    content: '{\n' +
      '  "message": "Found 3 entries",\n' +
      '  "items": [\n' +
      '    {\n' +
      '      "_id": "654b2805e51fcd815dea8e2d",\n' +
      '      "event": "Project presentation",\n' +
      '      "date": "2023-11-10",\n' +
      '      "time": "10:00",\n' +
      '      "additional_detail": "Important project presentation",\n' +
      '      "__v": 0\n' +
      '    },\n' +
      '    {\n' +
      '      "_id": "654d7b53ff0bf31de6c2af29",\n' +
      '      "event": "Meet Tony",\n' +
      '      "date": "2023-11-10",\n' +
      '      "time": "19:00",\n' +
      '      "additional_detail": "Electric Sheep Bar",\n' +
      '      "__v": 0\n' +
      '    },\n' +
      '    {\n' +
      '      "_id": "654d7d868751ea8704fc9f11",\n' +
      '      "event": "Dinner with Anna",\n' +
      '      "date": "2023-11-10",\n' +
      '      "time": "19:00",\n' +
      '      "additional_detail": "Electric Sheep Bar",\n' +
      '      "__v": 0\n' +
      '    }\n' +
      '  ]\n' +
      '}'
  }
]

Chat summary:

{                                                                                                                                   role: 'assistant',
  content: 'Good morning, Jerry. Today, you have a project presentation at 10:00 AM and a dinner meeting with Anna at 7:00 PM at the Electric Sheep Bar. You also have a meeting with Tony at 7:00 PM at the same location.'
}

user: it seems redundant to have separate entry for the same event. can you edit the event with Anna to include meeting Tony and delete the other event?

Function calling (invoked edit_calendar_entry and delete_calendar_entry):

{                                                                                                                                   role: 'assistant',
  content: null,
  tool_calls: [
    {
    id: 'call_ZVAwO5YsXMuWr8876tyCIPIX',
    type: 'function',
    function: {
      name: 'edit_calendar_entry',
      arguments: '{"event": "Dinner Meeting with Anna", "date": "2023-11-10", "time": "19:00", "additional_detail": "Meeting with Tony included."}'
    }
  },
  {
    id: 'call_eMGuI8LxLbXFsLp8FduCrxgp',
    type: 'function',
    function: {
      name: 'delete_calendar_entry',
      arguments: '{"delete_type": "event_name", "event": "Meeting with Tony"}'
    }
  }
  ]
}

Function response:

[                                                                                                                                   {
    tool_call_id: 'call_ZVAwO5YsXMuWr8876tyCIPIX',
    role: 'tool',
    name: 'edit_calendar_entry',
    content: '{\n  "message": "Entry edited",\n  "name": "Dinner Meeting with Anna"\n}'
  },
  {
    tool_call_id: 'call_eMGuI8LxLbXFsLp8FduCrxgp',
    role: 'tool',
    name: 'delete_calendar_entry',
    content: '{\n  "name": "Meeting with Tony",\n  "message": "Entry deleted"\n}'
  }
]

Chat summary:

{                                                                                                                                   role: 'assistant',
  content: 'The event "Dinner Meeting with Anna" has been updated to include the meeting with Tony, and the separate event "Meeting with Tony" has been deleted.'
}

user: okay, just to be sure, can you tell me my schedule again for today?

Function calling (invoke get_calendar_entry):

{                                                                                                                                   role: 'assistant',
  content: null,
  tool_calls: [
    {
    id: 'call_liqFn24Ds4qp0cy2V5NBg62v',
    type: 'function',
    function: { name: 'get_calendar_entry', arguments: '{"date":"2023-11-10"}' }
  }
  ]
}

Function response:

[                                                                                                                                   {
    tool_call_id: 'call_liqFn24Ds4qp0cy2V5NBg62v',
    role: 'tool',
    name: 'get_calendar_entry',
    content: '{\n' +
      '  "message": "Found 2 entries",\n' +
      '  "items": [\n' +
      '    {\n' +
      '      "_id": "654b2805e51fcd815dea8e2d",\n' +
      '      "event": "Project presentation",\n' +
      '      "date": "2023-11-10",\n' +
      '      "time": "10:00",\n' +
      '      "additional_detail": "Important project presentation",\n' +
      '      "__v": 0\n' +
      '    },\n' +
      '    {\n' +
      '      "_id": "654d7d868751ea8704fc9f11",\n' +
      '      "event": "Dinner with Anna",\n' +
      '      "date": "2023-11-10",\n' +
      '      "time": "19:00",\n' +
      '      "additional_detail": "Meeting with Tony included.",\n' +
      '      "__v": 0\n' +
      '    }\n' +
      '  ]\n' +
      '}'
  }
]

Chat summary:

{                                                                                                                                   role: 'assistant',
  content: 'Your schedule for today is as follows:\n' +
    '- Project presentation at 10:00 AM\n' +
    '- Dinner with Anna at 7:00 PM, which includes a meeting with Tony.'
}

user: ah you forgot to add the location in the dinner. sorry but can you update it again?

Function calling (invoked edit_calendar_entry):

{                                                                                                                                   role: 'assistant',
  content: null,
  tool_calls: [
    {
    id: 'call_eRTFMcz2l8qHO7nhp3msI3v8',
    type: 'function',
    function: {
      name: 'edit_calendar_entry',
      arguments: '{"event":"Dinner with Anna","date":"2023-11-10","time":"19:00","additional_detail":"Meeting with Tony at the Electric Sheep Bar."}'
    }
  }
  ]
}

Function response:

{
    tool_call_id: 'call_eRTFMcz2l8qHO7nhp3msI3v8',
    role: 'tool',
    name: 'edit_calendar_entry',
    content: '{\n  "message": "Entry edited",\n  "name": "Dinner with Anna"\n}'
  }

Chat summary:

{                                                                                                                                   role: 'assistant',
  content: 'The event "Dinner with Anna" has been updated to include the meeting with Tony at the Electric Sheep Bar.'
}

user: okay, i want to confirm it again, can you tell me my schedule for today?

Function calling (invoked get_calendar_entry):

{                                                                                                                                   role: 'assistant',
  content: null,
  tool_calls: [
    {
    id: 'call_HYlAj4BTeWcu2CxSp4zBYPuC',
    type: 'function',
    function: { name: 'get_calendar_entry', arguments: '{"date":"2023-11-10"}' }
  }
  ]
}

Function response:

[                                                                                                                                   {
    tool_call_id: 'call_HYlAj4BTeWcu2CxSp4zBYPuC',
    role: 'tool',
    name: 'get_calendar_entry',
    content: '{\n' +
      '  "message": "Found 2 entries",\n' +
      '  "items": [\n' +
      '    {\n' +
      '      "_id": "654b2805e51fcd815dea8e2d",\n' +
      '      "event": "Project presentation",\n' +
      '      "date": "2023-11-10",\n' +
      '      "time": "10:00",\n' +
      '      "additional_detail": "Important project presentation",\n' +
      '      "__v": 0\n' +
      '    },\n' +
      '    {\n' +
      '      "_id": "654d7d868751ea8704fc9f11",\n' +
      '      "event": "Dinner with Anna",\n' +
      '      "date": "2023-11-10",\n' +
      '      "time": "19:00",\n' +
      '      "additional_detail": "Meeting with Tony at the Electric Sheep Bar.",\n' +
      '      "__v": 0\n' +
      '    }\n' +
      '  ]\n' +
      '}'
  }
]

Chat summary:

{                                                                                                                                   role: 'assistant',
  content: 'Your schedule for today is as follows:\n' +
    '- Project presentation at 10:00 AM\n' +
    '- Dinner with Anna at 7:00 PM, which includes a meeting with Tony at the Electric Sheep Bar.'
}

Long-Term Memory

For Long-Term Memory, we have the following functions

All previous functions have primarily focused on typical data retrieval and setting tasks. But for long term memory retrieval, we cannot just simply do keyword lookup. We need to consider the context of the query. We need to do semantic search. And in this regard, we will be using the Embeddings API. Let us proceed to the next section to discuss this further.

Embeddings API

Update: I have replaced text-embedding-ada-002 with text-embedding-3-small. Based on my tests, the latter performs well enough. The cosine similarity scores between closely related answers and non-relevant answers are markedly distinct. Furthermore, the cost-performance of the v3 small model, priced at $0.00002/1K tokens, is a no brainer 😁. However, you need to convert your vector data from ada to v3 small since they are not compatible. Note that we also lowered the threshold from 0.72 to 0.3.

To put it simply, embeddings measures the relatedness of text strings. If we call the API, it will give us vector data of floating numbers associated with the input text.

const embedding = await openai.embeddings.create({
  model: "text-embedding-3-small", //"text-embedding-ada-002",
  input: "The quick brown fox jumped over the lazy dog",
  encoding_format: "float",
})

To use this in our application, we will implement what they call Retrieval-Augmented Generation or simple RAG.

Initially, when new data is received from the save_new_memory function, we call the Embeddings API to generate its vector representation. This vector data is then stored in MongoDB for future use.

Subsequently, when a user submits a query that requires memory retrieval, the get_info_from_memory function is triggered. We call the Embeddings API for the search parameters and compares them with the stored vector data using a simple cosine similarity. This comparison typically yields several matches with varying scores. We have set our threshold to a score of0.3 (0.72 for ada model) or higher, and we limit the results to a maximum of 10.

The results are then passed to the final Chat Completions API, which determines the most suitable response to the user’s query. The AI has the capability to select one or more results as the basis for its response, depending on the nature of the query. This is where the true power of AI is demonstrated. Rather than simply regurgitating all the information it receives, the AI analyzes the data and formulates an appropriate response. If the result from the RAG is deemed sufficient, it will generate a positive text response.

Mongoose

To help facilitate the storage of vector data and other vector operations in MongoDB, I am using the Mongoose module. By far, this is the most painless solution to handle storage and retrieval of vector data.

First, we define our schema:

const memoryVectorsSchema = new mongoose.Schema({
  chunks: [{ 
      embedding: [Number], 
      text: String 
    }]
})

Then we attach getScore function which executes cosine similarity between the document model and the embedded query string:

memoryVectorsSchema.methods.getScore = function getScore(query_embedding) {
  return this.chunks.map((chunk) => {
    const dot_product = chunk.embedding.reduce((sum, val, i) => sum + val * query_embedding[i], 0)
    const magnitude = (vec) => Math.sqrt(vec.reduce((sum, val) => sum + val * val, 0))
    const cosine_similarity = dot_product / (magnitude(chunk.embedding) * magnitude(query_embedding))
    return { text: chunk.text, score: cosine_similarity }
  })
}

Next, we compile our schema into a model:

const MemoryVectors = mongoose.model('MemoryVectors', memoryVectorsSchema)

To save a vector data:

const newVector = new MemoryVectors({
  chunks
})

newVector.save()

Cosine-Similarity

To search the stored vector data with the embeddings of the user query

const cosineSimThreshold = 0.3 // Note: for text-embedding-ada-002, 0.72 but for v3 embed models, use 0.3
const maxResults = 10
const maxFilesLength = 2000 * 3

const vectors = await MemoryVectors.find()
        
const vector_result = vectors.map((v) => {
  return v.getScore(search_query)
}).flat().sort((a, b) => b.score - a.score).filter((chunk) => chunk.score > cosineSimThreshold).slice(0, maxResults)

const query_result = vector_result.length > 0 ? vector_result.map((v) => v.text).join("\n").slice(0, maxFilesLength) : ''

That's it. Neat, huh?

Sample Conversation - Long-Term Memory

To demonstrate how all these work, let's check a sample conversation.

user: hey, tell me, what's your fave burger? cheese? bacon? mushroom?

bot: Oh, for sure! I'm all about that bacon burger, you know? The crispy bacon just takes it to a whole new level. What about you?

user: mushroom! with cheese! oh, i love it.

bot: Totally! Mushroom and cheese are a match made in burger heaven. Have you tried making your own mushroom and cheese burger at home? It's so worth it!

user: not yet. have you tried avocado?

bot: Oh, totally! Avocado on a burger is like the cherry on top, you know? It adds that creamy, buttery goodness that takes the whole thing to the next level. Have you tried it?

user: yeah. there is actually a funny story related to it. so i've had avocado burger before. then one time, me and my friends went to ski trip. along the way we stopped by a burger shop. while i was browsing the menu, i saw avocado and i pointed it when i ordered. do you know what happened next?

bot: No way! What happened? Did they give you a burger with, like, a whole avocado on it? That would be hilarious!

user: no. but close. they gave me a burger with slice of avocado. nothing else. it turns out, the one i picked on the menu is for vegan option!

bot: Oh my gosh, that's too funny! So you ended up with a plain avocado burger instead of the one you were expecting. That's like a mix-up straight out of a comedy movie! Did you end up enjoying the avocado burger, though?

user: i ate it alright but was not satisfied, of course. hey, can you save this funny memory, it will be good to recall sometimes

Function call (invoked save_new_memory):

{                                                                                                                                   role: 'assistant',
  content: null,
  tool_calls: [
    {
    id: 'call_2TIYJOo9WJcprD7ozQsERoo4',
    type: 'function',
    function: {
      name: 'save_new_memory',
      arguments: '{"memory_title":"Avocado Burger Mix-Up","memory_detail":"Ordered an avocado burger at a ski trip burger shop, but got a plain avocado burger instead. Turns out, the one I picked on the menu was the vegan option! Hilarious mix-up.","memory_date":"2023-11-10","memory_tags":["funny","food","burger"]}'
    }
  }
  ]
}

At this point we call the embeddings API and get the vector data for the whole arguments and save it with the text to MongoDB.

Function response:

[                                                                                                                                   {
    tool_call_id: 'call_2TIYJOo9WJcprD7ozQsERoo4',
    role: 'tool',
    name: 'save_new_memory',
    content: '{\n  "message": "New memory saved"\n}'
  }
]

bot: Memory saved! Now you've got that funny mix-up story to look back on whenever you need a good laugh. It's like a little slice of life, you know?

Now, for retrieving info from memory

user: by the way, do you know the burger joint that is closing this month? the one in shibuya? what was the name?

Function call (invoked get_calendar_entry):

{                                                                                                                                   role: 'assistant',
  content: null,
  tool_calls: [
    {
    id: 'call_3PNs56LMrmDViZUnrxdBScGz',
    type: 'function',
    function: { name: 'get_calendar_entry', arguments: '{"date":"2023-11-30"}' }
  }
  ]
}

Function response:

[                                                                                                                                   {
    tool_call_id: 'call_3PNs56LMrmDViZUnrxdBScGz',
    role: 'tool',
    name: 'get_calendar_entry',
    content: '{\n  "message": "No entries found"\n}'
  }
]

bot: I couldn't find any info about a burger joint closing in Shibuya this month. Maybe it's just a rumor, you know? But if you find out the name, let me know!

user: can you check your memory, i think i might have told you that before

Function call (invoked get_info_from_memory):

{                                                                                                                                   role: 'assistant',
  content: null,
  tool_calls: [
    {
    id: 'call_WVBvjR0MKgaaTzik7QBwk3Ei',
    type: 'function',
    function: {
      name: 'get_info_from_memory',
      arguments: '{"search":["burger","Shibuya","closing"]}'
    }
  }
  ]
}

Here we get the embeddings for the search terms instead of the actual user query. Then using the resulting vector data, we call mongoDb.searchEntry to search the stored vector data in MongoDB.

Function response:

[                                                                                                                                   {
    tool_call_id: 'call_WVBvjR0MKgaaTzik7QBwk3Ei',
    role: 'tool',
    name: 'get_info_from_memory',
    content: '{\n' +
      '  "message": "Retrieved related info for burger,Shibuya,closing",\n' +
      `  "result": "score: 0.8513867910905064\\n title: Becker's ハンバーガーショップ閉店 detail: 千葉の人気なバーガーショップ「Becker's」が閉店するらしい。 tags: 閉店,ハンバーガー,千葉\\nscore: 0.8506170963709474\\n title: トリキバーガーの渋谷店閉店情報 detail: 渋谷のトリキバーガーの店舗が閉店することが決まりました。最後の日は今月の20日です。お店は渋谷の井の頭通りにあります。 date: 2023-11-10 tags: トリキバーガー,渋谷,閉店情報\\nscore: 0.8307398914702265\\n title: Becker's閉店情報 detail: 千葉の人気なバーガーショップBecker'sが閉店する。最後の日は今月の22日。 date: 2023-11-22 tags: 閉店情報,Becker's...\n` +
    '}'
  }
]

I am only showing partial result of the content but it contains several hits from the database that is related to the search terms. We send all the results to the Chat completions API and let the AI decide what to do.

bot: Oh, right! The burger joint that's closing in Shibuya is the Triki Burger. Their Shibuya location is shutting down on the 20th of this month. It's sad to see them go, you know?

If you noticed, the partial result shown is in Japanese! So, definitely, this shows that we are not doing just simple keyword lookup here. You can add data of mixed language and query the memory using different language and chances are you will still get the answer. Let's try using our Russian bot, Vitaly. I will be using Google Translate since I don't speak Russian.

user: Виталий, можешь вспомнить название моих любимых картофельных чипсов? (Vitaly, can you remember the name of my favorite potato chips?)

Function calling (invoked get_info_from_memory):

{                                                                                                                                   role: 'assistant',
  content: null,
  tool_calls: [
    {
    id: 'call_8fXETNlDMuKEXsitUuqt5PlE',
    type: 'function',
    function: {
      name: 'get_info_from_memory',
      arguments: '{"search":["картофельные чипсы"]}'
    }
  }
  ]
}

So, again, we get the embeddings for the search term and use it against each vector data stored in MongoDB.

Function result:

[                                                                                                                                   {
    tool_call_id: 'call_8fXETNlDMuKEXsitUuqt5PlE',
    role: 'tool',
    name: 'get_info_from_memory',
    content: '{\n' +
      '  "message": "Retrieved related info for картофельные чипсы",\n' +
      `  "result": "score: 0.7718838658905126\\n title: トムさんのお気に入りのパン detail: トムさんのお気に入りのパンはカレーパンとあんドーナツです。カレーパンはスパイシーなカレーとサクサクの生地が絶妙に合い、あんドーナツは甘さともちもち感が魅力です。\\nscore: 0.7560670627592779\\n title: Crisp Bacon Potato Chips detail: Brand: Dice tags: snack,potato chips,brand\\nscore: 0.7354783965092787\\n title: あまとうのあんみつ detail: 小樽のあまとうで食べるあんみつは絶品。...\n` +
      '}'
  }
]

You can see that the data result shown is mixed, with Japanese and English texts.

bot: Твои любимые картофельные чипсы - Crisp Bacon Potato Chips. (Your favorite potato chips are Crisp Bacon Potato Chips.)

Okay, let's proceed to the final step.

Text-to-Speech API

Upon receiving the text response from the preceding steps, we will now call the Text-to-speech API to enable our chatbot to vocalize the response

try {

  const mp3 = await openai.audio.speech.create({
    model: 'tts-1',
    voice: 'alloy',
    input: 'The quick brown fox jumps over the lazy dog.',
  })

  const buffer = Buffer.from(await mp3.arrayBuffer())
  await fs.promises.writeFile(filename, buffer)

} catch(error) {

  throw error

}

Voices

Since the doc does not provide specific descriptions for each voice, so I asked ChatGPT to make some educated guesses based on their names:

  1. Alloy (F): versatile and could be suitable for technical or instructional content, such as tutorials, educational material, or any content where clarity and precision are essential.

  2. Echo (M): might be fitting for storytelling, audiobooks, or content that requires a more dramatic or narrative tone.

  3. Fable (F): could be ideal for children's stories, fantasy content, or any narrative that requires a more playful or imaginative tone.

  4. Onyx (M): might be suitable for reading serious literature, delivering news content, or any material that requires a more serious or formal delivery.

  5. Nova (F): suitable for advertisements, motivational content, or any material that requires a positive and enthusiastic delivery.

  6. Shimmer (F): suitable for lifestyle content, podcasts, or anything that requires an engaging, friendly, and inviting tone.

The generated audio data will then be saved to the public/upload directory in MP3 file format. We will then send the link to the client side where a dynamic HTMLAudioElement will load and play it.

const audioDomRef = new Audio()
audioDomRef.type = 'audio/mpeg'
audioDomRef.addEventListener('loadedmetadata', handleAudioLoad)
audioDomRef.addEventListener('ended', handleAudioEnded)
audioDomRef.addEventListener('error', handleAudioError)
audioDomRef.src = audio_url

function handleAudioLoad() {

  audioDomRef.value.play()

}

Please note that the recording of audio will be disabled while the audio response is playing. This ensures that the application does not record the bot’s response.

FFMPEG Setup

FFMPEG is used to remove silent parts in the audio file.

To install ffmpeg command-line tool

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

MongoDB Setup

MongoDB will be used to store calendar entries and vector data for long-term memory function.

To install MongoDB Community Edition, please check this page. You might also want to install MongoDB Shell to view the database.

App Setup

First, be sure that ffmpeg and MongoDB are installed in your system.

To clone the project repository and install the dependencies

$ git clone https://github.com/supershaneski/openai-whisper-talk.git myproject

$ cd myproject

$ npm install

Copy .env.example file and rename to .env, then open it and edit the items there with actual values. For the MongoDB items, you probably do not need to edit them unless you have a different setup.

NUXT_OPENAI_API_KEY=your-openai-api-key
NUXT_MONGODB_HOST_NAME=localhost
NUXT_MONGODB_PORT=27017
NUXT_MONGODB_DB_NAME=embeddingvectorsdb

Then to run the app

$ npm run dev

Open your browser to http://localhost:5000/ (port number depends on availability) to load the application page.

Using HTTPS

Note: I have not yet tested this with the latest update

You might want to run this app using https protocol. This is needed to enable audio capture using a separate device like a smartphone.

In order to do so, prepare the proper certificate and key files and edit server.mjs at the root directory.

Then buid the project

$ npm run build

Finally, run the app

$ node server.mjs

Now, open your browser to https://localhost:3000/ (port number depends on availability) or use your local IP address to load the page.

Footnotes

  1. You can find these variables in pages/talk/[id].vue file. 2

  2. If by date, item needs to be only one under the date otherwise an error will be returned.

More Repositories

1

openai-whisper

A sample web app using OpenAI Whisper to transcribe audio built on Next.js. It records audio continuously for some time interval then uploads the audio data to the server for transcribing/translating.
JavaScript
161
star
2

openai-api-function-call-sample

A sample app to demonstrate Function calling using the latest format in Chat Completions API and also in Assistants API.
JavaScript
86
star
3

openai-whisper-api

A sample speech transcription app implementing OpenAI Text to Speech API based on Whisper, an automatic speech recognition (ASR) system, built using Next 13, the React framework
JavaScript
75
star
4

openai-assistants-api-multi-user-sample

This sample project demonstrate the OpenAI Assistants API’s ability to manage single-threaded multi-user interactions through a full-stack app using Node.js, Vue.js, and socket.io for server-client communication.
Vue
51
star
5

chatgpt-with-image-sample

This sample project integrates OpenAI's GPT-4 Vision, with advanced image recognition capabilities, and DALL·E 3, the state-of-the-art image generation model, with the Chat completions API. This powerful combination allows for simultaneous image creation and analysis.
JavaScript
23
star
6

callcenter-simulator

A sample callcenter simulator using OpenAI technologies such as ChatGPT, Whisper and other APIs to automate and improve callcenter operation. This project is intended to be a sandbox application to test feasibility and other ideas.
JavaScript
23
star
7

openai-assistants-api-streaming

A sample application to demonstrate OpenAI Assistants API streaming, built using Next.js.
JavaScript
17
star
8

openai-chatfriend

A chatbox application built using Nuxt 3 powered by Open AI Text completion endpoint. You can select different personality of your AI friend. The default will respond in Japanese. You can use this app to practice your Nihongo skills!
Vue
14
star
9

openai-chatgpt-api

A sample interactive storytelling narrative chatbot application using ChatGPT API, powered by gpt-3.5-turbo, OpenAI’s advanced language model, built using Next 13, the React framework.
JavaScript
12
star
10

react-three-terrain

A terrain map visualizer based from height maps using React and three.js
JavaScript
9
star
11

openai-chatterbox

A sample Nuxt 3 application that listens to chatter in the background and transcribes it using the powerful OpenAI Whisper, an automatic speech recognition (ASR) system.
Vue
9
star
12

mytrip-hokkaido-app

This sample project is a customizable regional travel planning app that uses artificial intelligence to generate itineraries based on user text, powered by OpenAI Chat Completions API, built using Next.js 13.
JavaScript
7
star
13

vue-quiz-app

A sample quiz application built using Vue 3 + Vite. Implements route views and components using Composition API. Quiz data fetched remotely using Open Trivia DB API (opentdb).
Vue
6
star
14

chatgpt-learning-app

This sample React app aims to be a learning hub that aids students in their studies. By providing topics from their actual course syllabus, users can interact with the AI chatbot tutor and engage in dynamic conversations related to the topics. They can ask questions, explore concepts, and generate quiz to test your knowledge.
JavaScript
6
star
15

grocery-list-app

This is a sample React project that generates a grocery list of ingredients based on a menu or a list of dishes. It is powered by the OpenAI Chat Completion API and built using Next.js 13.
JavaScript
5
star
16

openai-chatgpt-prompts

Collection of useful custom prompts for OpenAI ChatGPT [プロンプトコレ]
5
star
17

discussion-app

This is a discussion app that allows you to explore various viewpoints on specific topics. Simply provide a topic, and it will generate multiple responses based on different perspectives, enabling you to learn and understand various views. You can customize the attributes of each point of view or character.
JavaScript
4
star
18

three-pipebuilder

Proof of concept for a 3D pipebuilder project. Using React + ThreeJS + Bootstrap
JavaScript
3
star
19

react-moving-dottext

A React component that mimics the LED signage moving text display
JavaScript
2
star
20

openai-chatgpt-desktop

A sample app that enables user to conveniently access OpenAI ChatGPT right from their desktop menubar.
JavaScript
2
star
21

NextJSSampleDataVisualization

First attempt to make a Data Visualization app using NextJS/React
JavaScript
2
star
22

three-samples

Sample exercises for ThreeJS
JavaScript
1
star
23

bun-openai

This repository contains a collection of various applications powered by the OpenAI API, with backend support from Bun server.
JavaScript
1
star
24

openai-tutorial-sample

A sample tutorial app based on modified OpenAI API quickstart tutorial
JavaScript
1
star
25

react-memory-game

A sample Memory Game using React.JS
JavaScript
1
star
26

react-redux-memo

A sample MemoBoard application using React/Redux and Material-UI
JavaScript
1
star
27

react-jest-sample

JavaScript
1
star
28

NextJSBlogSample

Sample Blogsite using Next.JS
JavaScript
1
star
29

react-google-charts-example

A project using react-google-charts module and googledoc as data source
JavaScript
1
star
30

vue-whackamole

A Whack-A-Mole game created using Vue.JS
Vue
1
star
31

React3DXRoomSimulator

React + 3DX sample project
JavaScript
1
star
32

NextJSQuizApp

Sample Quiz App using NextJS/React
JavaScript
1
star
33

react-crud-app

A simple Todo app that serves as the React layer to demonstrate a MERN stack
JavaScript
1
star
34

next-ipsum-app

A placeholder text generator based on excerpts taken from the novel Noli Me Tangere by Jose Rizal.
JavaScript
1
star
35

NextJSSlidePuzzle

A Slide Puzzle App using NextJS/React
JavaScript
1
star
36

react-piano-openai

A piano application using Web Audio api created using React based on the reply given by OpenAI ChatGPT
JavaScript
1
star
37

react-ml5-moviereview

A sample React project to check the favorable score of user reviews for movies using ml5.js and data from TMDb.
JavaScript
1
star
38

react-mosaic-app

A simple image mosaic builder made using React
JavaScript
1
star
39

react-ui-components

A collection of React components and custom hooks based on generated code by OpenAI ChatGPT
JavaScript
1
star
40

openai-structured-output-sample

A sample application to demonstrate how to use Structured Outputs in OpenAI Chat Completions API with streaming, built using Next.js.
JavaScript
1
star