Real Gemini demo? Rebuild with GPT4V + Whisper + TTS

Written by AI Jason - December 29, 2023

Which path is safer for the little bird to go pass - path one or path two?
Path one is safer for the little bird to go as it avoids the cat. Path two leads directly to the cat, which could be dangerous for the bird.

What should be the next shape in this sequence?
The next shape in the sequence should be a hexagon.

Which book is standard for me to read if I want to learn AI?
If you want to learn about AI, "The Coming Wave" by Mustafa Suan would be the more appropriate choice. It seems to be focused on the future of AI and its implications, which would be relevant to your interests in artificial intelligence.

Google had their infamous demo of Gemini, where Gemini is responding to real-time video almost instantly. This really blew everyone's mind. But later, it turned out that this demo was heavily edited. It cut off all the latencies and behind the scenes, they were feeding a lot of different prompts into Gemini to get the popular response back.

However, even though Gemini Ultra is not available yet, people started doing a lot of tests – taking example of the Google test with Gemini Ultra with GPT 4V. In terms of the tasks that the Gemini Ultra claimed to be very good at, GPT 4B was able to do almost all of them. From solving a mass problem to understanding images of natural places, as well as object detection, GPT 4V was able to solve most of the examples.

So people's expectation of Google's Gemini model really changed overnight. But I think we should still give a lot of credit to Google's Gemini demo because it really showcased a lot of interesting applications. We have a real native multimodal model that can understand all sorts of different data. The big question I had in the back of my mind was, can we actually rebuild this Gemini demo with GPT 4V? And what would be the results we can get today?

Cuz theoretically, the way it works is pretty straightforward. With Gemini, it basically feeds both video and audio data to the Gemini model and it will be able to speak back because it has that audio generation built-in. But can we totally build the same experience with GPT 4V plus whisper and text-to-speech models?

We will need to turn the video into different screenshots that we can pass on to GPT 4V. And then we can also turn the user's voice into a text prompt with whisper's transcription ability. We can feed both transcript and screenshots to GPT 4V and utilize a text-to-speech model to let AI speak back.

But on the other hand, there are also some challenges that I didn't know how to solve yet. For example, I want an experience where the user doesn't need to interact with any UI. They literally just point the camera and talk to AI. To achieve this kind of hands-free UX, we need a way to actually detect when users stop talking so that we can send those requests to OpenAI.

How can we actually detect when users stop talking? It is a problem that I'm not too sure how to solve.

And on the other hand, how can we stream the live video effectively? We can take multiple different screenshots, but does it actually deliver the performance that we want? Will it be able to answer user questions and instructions effectively? And most importantly, how big will the latency actually be when we build this system with multiple different models working together?

Luckily, I found one project done by Julian Deuca, where he basically implemented the Gemini demo with almost the same structure that I was thinking about. And the result is quite stunning. From his demo, you can tell if someone is moving their hand left or right, what they are drawing, if they are doing well with a game, where to place certain objects, and even suggest a game to play together.

He figured out some simple but effective ways to solve the problems that I had before. For example, for how to detect when users stop talking, he found a library that we can use right away called "silence-aware-recorder". It is an open-source library that can automatically detect if the user stops talking for a certain time.

And in terms of streaming the live video, he figured out a way to stitch together all the different screenshots into an image grid, which from experience, seems to communicate time sequence better to GPT 4V. I think it also reduces latency because we are now sending 60 different images to GPT 4V, everything wrapped under an image grid that can be sent. And the latency doesn't seem to be too bad. It does have like three to four seconds latency, but it's already better than I thought.

And he open-sourced this demo called "GPT-Video" to showcase how he built this demo. I tried it out and it really works! It is a great example of how we can build this multimodal application.

Step-by-step Guide to Recreating Gemini Demo

Step 1: Set up a Next.js Project

To start with, we need to set up a Next.js project. Open your terminal and run the following command:

```html npm create-next-app gemini-demo --ts ```

Let's install additional packages that we're going to use:

```html npm install axios openai-js media-recorder-proxy @openai/curie-node ```

Step 2: Create the Chat Component

In the 'pages' directory, create a new file called 'chat.tsx'. This file will contain the main logic of the chat application.

```html import React, { useRef, useState, useEffect } from 'react'; import { useLocalStorage } from '../lib/useLocalStorage'; import { OpenAIClient, OpenAIConfiguration, StreamTextResponse } from 'openai'; const chatbotName = 'Gemini'; const openAiConfig: OpenAIConfiguration = { apiKey: process.env.OPENAI_API_KEY || '', promptSignature: process.env.PROMPT_SIGNATURE || '', }; const vitally = new VitallyClient({ apiKey: process.env.VITALLY_API_KEY, appVersion: process.env.NEXT_PUBLIC_APP_VERSION, user: { userId: process.env.SEGMENT_ANONYMOUS_ID } }); export const Chat: React.FC = () => { const chatContainerRef = useRef(null); const [conversation, setConversation] = useState[]>([]); const [inputValue, setInputValue] = useState(''); const [isOpenAiMessage, setIsOpenAiMessage] = useState(false); const [typing, setTyping] = useState(false); const [localState, setLocalState] = useLocalStorage(chatbotName, {}); const openAi = new OpenAIClient(openAiConfig); useEffect(() => { if (chatContainerRef.current) { chatContainerRef.current.scrollTop = chatContainerRef.current.scrollHeight; } }, [conversation]); // ... Rest of the code return (

{chatbotName}

{conversation.map((message, index) => ( ))} {typing && (

)}

); }; ```

Step 3: Implement Chat Functionality

We will implement the functionality to send and receive messages between the user and the chatbot. This involves capturing user input, processing the message with OpenAI, and displaying the responses.

```html // ... Rest of the code const handleInputChange = (event: React.ChangeEvent) => { setInputValue(event.target.value); }; const handleInputKeyDown = (event: React.KeyboardEvent) => { if (event.key === 'Enter' && inputValue.trim() !== '') { handleSendMessage(); } }; const handleSendMessage = () => { const message = { type: 'user', text: inputValue.trim(), }; setConversation((prevConversation) => [...prevConversation, message]); setInputValue(''); setTyping(true); vitally.trackAction('Sent Message', { input: inputValue, timestamp: Date.now(), }); openAi .complete({ messages: conversation }) .then((response) => { const message = { type: 'bot', text: response.choices[0].message.content.trim(), }; setConversation((prevConversation) => [...prevConversation, message]); }) .catch((error) => { console.error(error); }) .finally(() => { setTyping(false); }); }; // ... Rest of the code return ( // ... Rest of the code ); }; ```

Step 4: Styling the Chat component

We need to add some CSS to style the chat component. You can customize the styling according to your preference.

```html ```

Step 5: Using the Chat component

Now that we have implemented the Chat component, we can use it in our application. Import the Chat component in your page and place it wherever you want the chatbot to appear.

```html import { Chat } from '../components/Chat'; export default function Home() { return (

Feel free to chat with me.

); } ```

Master AI-Powered Scraping: Extract Data from 99% of Websites

In today's data-driven world, the ability to extract and utilize information from the web is a crucial skill. Whether you're a data scientist, a business analyst, or just someone looking to gather ins
How to Earn $1,370+ Daily with Canva AI's New Money-Making Method

If you're looking for a unique and underrated side hustle that can potentially earn you over $1,370 per day, then you're in for a treat. This method leverages the power of Canva's AI tools to create s
Build a Full-Stack App for FREE with No Coding Using Bolt.DIY, Gemini 2.0, and Deepseek-V3

Building a full-stack application without any coding knowledge and for free might sound too good to be true, but with the right tools, it's entirely possible. In this article, we'll guide you through
DeepSeek V3 Released: Could This Free LLM Outperform ChatGPT?

In the ever-evolving landscape of artificial intelligence, new models and tools frequently emerge, each promising to revolutionize how we interact with technology. The latest entrant generating buzz i
Is Journalist AI the Ultimate AI Writing Tool You've Been Looking For?

Is Journalist AI the ultimate AI writing tool you've been searching for? In this article, we delve into an in-depth review of Journalist AI, exploring its features, advantages, and potential drawbacks

Real Gemini demo? Rebuild with GPT4V + Whisper + TTS

Step-by-step Guide to Recreating Gemini Demo

Step 1: Set up a Next.js Project

Step 2: Create the Chat Component

Step 3: Implement Chat Functionality

Step 4: Styling the Chat component

Step 5: Using the Chat component

Master AI-Powered Scraping: Extract Data from 99% of Websites

How to Earn $1,370+ Daily with Canva AI's New Money-Making Method

Build a Full-Stack App for FREE with No Coding Using Bolt.DIY, Gemini 2.0, and Deepseek-V3

DeepSeek V3 Released: Could This Free LLM Outperform ChatGPT?

Is Journalist AI the Ultimate AI Writing Tool You've Been Looking For?