GPT4V + Puppeteer = AI agent browse web like human?

Written by AI Jason - December 30, 2023


GPT4V + Puppeteer = AI agent browse web like human? 🤖

There's one type of AI agent use case that has been trending really, really fast for the past few weeks, and we see multiple teams making huge progress towards this direction. From Hyper R teams publishing a self-operating computer framework where they let GPD 4V have direct access and control of your whole computer, to teams from multi showcasing a web AI agent that has direct access to the web browser and can complete the California online driving test successfully by itself. This is the first fully autonomous completion of a real-world human knowledge task by AI. So, this type of agent system, where it gives a super-powerful multimodal model like GPT 4V direct computer access, is such a fascinating idea as it seems to unlock so much potential.

Use Case and Market Opportunities

One way for us to look at this is to examine previous attempts when people tried to build similar systems. What were the use cases and limitations, and what new technology or changes now can potentially change the dynamic? One very direct market category, in my opinion, is RPA, which represents Robotic Process Automation. It is basically a category of software that helps enterprises build automated robots that can handle repetitive and standardized tasks. One of the most representative platforms in this category is UiPath, where they build a platform that allows you to build automation to access and interact directly with your desktop apps, from calculators to browser interactions, as well as Excel or legacy systems that have no API endpoint. This provides a huge amount of value to enterprises where they have a lot of admin tasks that simply move data from one system to another. It is a fast-growing segment, where in 2022, enterprises already spend more than $3 billion every year on process automation.

But the limitation of those RPA solutions are also quite clear. Most of those systems can't really handle non-standardized or ever-changing processes, not to mention processes that involve more complex decision-making. For example, let's say I am a shoe brand and I want the robot to do the data scraping of the pricing and product information from Nike, Adidas, and Puma. I actually need to build a very specific process for every single website because each website's structure is different. And if one of them updates the website structure, the previous automation is going to break. So every single RPA process you build is very specific and can be fragile to any environment change. Thus, the setup cost is very high. That's why currently, all those RPAs are mainly used for enterprises where they have standardized and repetitive tasks that have a huge volume to justify the cost. But that's also why those multimodal AI agents that can directly control the computer and browser are so exciting because theoretically, they can handle much more complex situations with much less setup cost. In the data scraping example, instead of me setting a very specific automation to tell the robot that it needs to look at this HTML tag to get this, I can simply give the agent the URL of different competitors and let it automatically navigate the website, take a screenshot, and extract data regardless of format change because the agent will do the decision-making itself.

And on the other hand, those AI agents can go way beyond just automation. They can complete intelligent tasks. For a customer support agent, it can look through the conversation history with the customer, summarize it automatically, fill in the better data, and escalate to the writing. So if done well, I believe this web AI agent can potentially open up the market for consumer use cases that have much less volume and also go beyond just normal automation, but complete actual digital worker jobs, like customer support, sales, and marketing, as the agent has the ability to access more and more different systems. We're definitely getting closer and closer to having those real AI workers deployed in companies. However, I often find the big gap to deliver a useful AI worker solution. It's not necessarily just the understanding of technology, but also the understanding of the end-to-end workflow for specific job functions. That's why I want to introduce you to a research report that HubSpot just did where they surveyed and interviewed more than 1,400 global sales leaders to really understand how the modern sales rep team works, and what does their whole end-to-end workflow look like. They cover a lot of insights from the key challenges and opportunities the sales teams are facing in 2024, as well as how a significant amount of sales reps shifted into relationship-building, which involves a lot of tedious admin tasks. They also dive into the best practices as well as the top AI use cases those sales leaders are adopting at the moment. This gives a super useful deep dive of how the sales function works and what are the key opportunities there. I found it super useful, so if you're trying to build an AI agent for the sales function, I definitely recommend you go have a look. You can click on the link in the description below to download this report for free.

Building an AI Web Agent

Now let's get into building an AI web agent that can have direct control of your web browser and do sophisticated web research and tasks. I will take you through a step-by-step example of how you can build a web AI agent that can automatically navigate through websites to do research and web tasks, based on an example provided by Unconventional Coding, where he built a GPT-4V powered web scraper. This really made the whole project a lot easier. So, let's try to build an AI agent that can view and interact with your web browser.

GPT-4V Powered Web Scraper

We will use a Node.js library called Puppeteer to actually take screenshots and control the web browser. To do that, let's first open Visual Studio Code and create a file called `screenshot.js`. Install the Puppeteer Extra plugin called Stealth to make this auto web browser less detectable for websites. Define the URL that we want to scrape, and in this example, we will use a normal Wikipedia page to test. Define a timeout as some web pages can take a while to load. Create an async function, and inside it, the first thing we do is launch a new browser and open a new page. Set the viewpoint, which is how big the screen should be, and then go to the specific URL we defined above. Wait until the document is fully loaded. Take a screenshot of the page and save it as a `.jpg` file. Finally, close the browser to finish the session.

To set up the Node.js project, run `npm init` and leave everything empty. Install Puppeteer by running `npm install puppeteer`. Install Puppeteer Extra by running `npm install puppeteer-extra`. Now, you can run the file we just created by running `node screenshot.js`. This will take a screenshot of the Wikipedia page.

However, there is one issue. The script is not using my existing Chrome profile. This means it won't be able to access websites like LinkedIn or Instagram that I have logged in to. To fix this, you need to download a special Chrome version called Chrome Canary. Copy the contents of the `Default` folder in your Chrome profile and replace the `Default` folder in the Chrome Canary folder. Open Chrome Canary and log in to all the accounts you want to use, like LinkedIn or Instagram. Then, modify the `screenshot.js` file to use the Chrome Canary version and the correct user data directory. Now, running `node screenshot.js` should take a screenshot of any website using your own Chrome profile, including those that would normally block a scraping service.

GPT-4V Powered Web Scraper with Python

Now, let's move on to creating a Python file that can call the JavaScript file to take a screenshot and then use GPT-4V to extract data from that screenshot. Create a Python file called `vision_scraper.py` and an `.env` file to store the OpenAI API key. In `vision_scraper.py`, import the necessary libraries, load the `.env` file, and create an instance of the OpenAI model. Define the function `image_to_b64` to convert the image file into a format that can be passed to GPT-4V. Create a function called `url_to_screenshot` that takes a URL as input, removes any old screenshot file if it exists, and runs the `screenshot.js` file using the Python subprocess module. If a screenshot is taken, it will return the path to the screenshot file. Pass this screenshot file to GPT-4V using the `vision_extract` function, which creates a chat completion with GPT-4V. Feed GPT-4V with a system prompt and a message prompt that includes the image file and the user prompt. Finally, display and return the result from GPT-4V. Connect everything together with a function called `vision_quote` that generates an image and returns the extracted information.

To run the Python script, make sure you have installed the required packages by running `pip install -r requirements.txt`. Then run `python vision_scraper.py`. You can now interact with the AI agent by typing a prompt and getting the extracted information from the screenshot. You can ask for weather information, scrape pricing pages, or even ask more complex questions like how to be featured on an Instagram account.

Building an Advanced AI Web Agent

Now, let's move on to building a more advanced AI web agent that can actually interact with different websites, click on links to navigate, and fill in text inputs and forms if needed. Create a new JavaScript file called `web_agent.js` and import the necessary libraries. Install the `open` package by running `npm install open`. Create an instance of the OpenAI model and define a timeout for waiting for page loads. Create a function called `image_to_b64` to convert a local image into a format that can be passed to GPT-4V. Create a command line interface (CLI) that allows users to type prompts and a sleep function to wait for page loads. Create a function called `highlight_links` that removes any previous highlighted bounding boxes and returns all the buttons, inputs, text areas, and links on the web page. For each button or link, create a function called `is_element_visible` that checks if the element is visible on the page based on several criteria. Highlight all the interactive elements and clean up the link text if the element is visible. Use a special attribute called `gbt_link_text` to identify the elements that GPT-4V should interact with and wait for an event. Create a function called `get_page` that sets up the Puppeteer browser and page, sets the viewport, and navigates to the specified URL. Highlight all the links on the page and take a screenshot. Load the image and pass it to GPT-4V along with the instructions for the screenshot. Create the main loop that continuously navigates between pages and interacts with them based on the instructions from GPT-4V. Display the results to the user and continue the loop as long as needed. Run the `web_agent.js` file using Node.js. You can now interact with the AI web agent by typing prompts and watching it navigate the web browser, click on links, and scrape information from different websites.

The Future of AI Web Agents

With the advancements in technology, AI web agents have the potential to revolutionize the way we interact with the web. From automation and data scraping to intelligent tasks and digital workers, AI web agents can open up new market opportunities and provide solutions to complex challenges. However, there are still many improvements to be made in terms of accuracy, interaction, and handling complex forms. But as we continue to develop and refine these AI web agents, we are getting closer to deploying real AI workers in companies and unlocking their full potential.

Frequently Asked Questions

1. Can AI web agents replace human workers?

No, AI web agents cannot replace human workers entirely. While they can automate certain tasks and assist with data scraping and research, human workers are still needed for complex decision-making, critical thinking, and interpersonal skills.

2. What are the limitations of AI web agents?

AI web agents have limitations when it comes to interacting with complex web pages, handling dynamic content, and accurately understanding user prompts. They may also face challenges with form filling and interacting with websites that have strong security measures or anti-scraping mechanisms.

3. How can AI web agents benefit businesses?

AI web agents can benefit businesses by automating repetitive tasks, speeding up data scraping and research, and providing valuable insights and recommendations. They can also assist with customer support, sales, and marketing tasks, freeing up human workers to focus on more complex and strategic activities.

4. Are AI web agents legal?

Using AI web agents for scraping, data extraction, and automation may be subject to legal and ethical considerations. It is important to ensure compliance with applicable laws and regulations, respect website terms of service, and consider privacy and data protection laws when deploying AI web agents.

5. What are the challenges in building AI web agents?

Building AI web agents involves challenges such as training accurate models, handling dynamic web content, adapting to website changes, and ensuring the security and privacy of user data. It also requires expertise in machine learning, web development, and understanding the specific workflows and requirements of different industries and use cases.

In conclusion, AI web agents hold great potential in automating tasks, extracting data, and assisting with various web-related activities. They offer exciting opportunities for businesses and users alike. As technology continues to advance, we can expect further improvements and innovations in this field. So let's embrace the future of AI web agents and explore the possibilities they bring!

  1. In today's data-driven world, the ability to extract and utilize information from the web is a crucial skill. Whether you're a data scientist, a business analyst, or just someone looking to gather ins

  2. If you're looking for a unique and underrated side hustle that can potentially earn you over $1,370 per day, then you're in for a treat. This method leverages the power of Canva's AI tools to create s

  3. Building a full-stack application without any coding knowledge and for free might sound too good to be true, but with the right tools, it's entirely possible. In this article, we'll guide you through

  4. In the ever-evolving landscape of artificial intelligence, new models and tools frequently emerge, each promising to revolutionize how we interact with technology. The latest entrant generating buzz i

  5. Is Journalist AI the ultimate AI writing tool you've been searching for? In this article, we delve into an in-depth review of Journalist AI, exploring its features, advantages, and potential drawbacks