Nilenso Blog 09月30日
AI助手加入会议
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一个AI助手如何加入Google Meet会议,自动记录笔记,并在被直接称呼时进行语音回应。该项目使用Puppeteer控制浏览器,PulseAudio管理虚拟音频设备,以及Google Gemini进行实时AI交互。虽然原型存在一些局限性,如安全性、上下文丢失和语音识别问题,但它展示了AI助手在会议中应用的潜力。

该AI助手使用Puppeteer控制浏览器,导航Meet的UI,处理麦克风和摄像头权限,并捕获音频流。这使其能够像普通参与者一样加入会议。

AI助手通过PulseAudio创建虚拟麦克风,既能播放AI的回应,也能捕获音频数据。这使其能够实时与会议进行交互。

AI助手使用Google Gemini通过WebSocket进行实时通信。它能够持续处理音频,理解何时被直接称呼,并生成适当的回应。

The funny thing about artificial intelligence is that the astonishing amount of intelligence we have today is terribly underutilised. The bottleneck is integration, not intelligence.

In our weekly all-hands meeting, we usually assign a person to take notes of what’s being discussed and spell out the outcomes, owners and action items when the meeting ends. Sometimes this person may pull out important context from previous meetings by looking at older notes. It’s valuable grunt work.

Why not drop an AI assistant straight into our Google Meet calls?

I’m not quite satisfied with how AI integrations in meetings are mostly about summarising things after the fact. The process of ensuring that a meeting goes well as it happens is far more valuable than a summary. It’s about ensuring things stay focused, and the right information and context is available to all participants.

LLMs (Large Language Models) are mainstream because of interfaces like ChatGPT—you type something, wait a bit, and get text back. Far fewer people know that models can also natively work with audio. They can process speech directly, understand the nuances of conversation, and even respond with natural-sounding voice. The challenge is: how do we actually plug this intelligence into our existing tools?

That’s what this project explores. I quickly built a bot that:

As you can see in the demo, the product is nowhere near production quality–it particularly struggles to handle the transition between taking notes and talking to participants, but it shows a lot of promise. I don’t see a fundamental limitation in its ability to use tools well enough.

System Overview

Looking at the diagram above, we have three main components:

    Browser Automation: Using Puppeteer to control a Google Chrome instance that joins the Meet call Audio Pipeline: Converting between different audio formats and managing virtual devices Google Gemini Integration: Handling the actual AI interactions through WebSocket connections

Each of these parts has its own challenges. Let’s dive into them one by one.

Joining a Google Meet is surprisingly tricky

Google Meet wasn’t exactly designed with bots in mind. The official APIs don’t let you do much. To get our assistant into our call, I came up with this:

The code is simple, but getting there involved some trial and error. Meet’s UI elements don’t always have consistent selectors, and the timing of operations is crucial. I had some trouble getting the default locator methods of puppeteer to work well, and I ended up resorting to injecting code into the browser that manually queried the DOM to keep things moving.

There’s also some subtle edge cases, for example, when you join a meeting with over five participants, you are muted by default. The first version of Lenso was talking without unmuting itself when it joined such meetings!

The Audio Pipeline

This is where things get interesting. We need to:

    Capture the audio stream from Meet (I used puppeteer-stream for this, which is a package that uses the Chrome extension API to expose browser audio) Convert it to 16kHz PCM format that Gemini expects Receive Gemini’s responses into a buffer Feed that audio data back into Meet through a virtual audio device set up with PulseAudio

The trickiest part was handling the virtual audio devices. We use PulseAudio to create a virtual microphone that can both play our AI’s responses and capture them for Meet. Here’s a sketch:

async createVirtualSource(sourceName = "virtual_mic") {  this.sourceName = sourceName;  // Create null sink and store its module ID  const { stdout: sinkStdout } = await execAsync(    `pactl load-module module-null-sink sink_name=${sourceName}`,  );  this.moduleIds.sink = sinkStdout.trim();  // Create remap source and store its module ID  const { stdout: remapStdout } = await execAsync(    `pactl load-module module-remap-source ` +      `source_name=${sourceName}_input ` +      `master=${sourceName}.monitor`,  );  this.moduleIds.remap = remapStdout.trim();  // Set as default source  await execAsync(`pactl set-default-source ${sourceName}_input`);  return sourceName;}

And to “speak” into this mic:

writeChunk(chunk) {  // Guard against uninitialized stream  if (!this.pacat || !this.isPlaying || this.pacat.killed) {    return false;  }  // Append new chunk to processing buffer  const newBuffer = new Uint8Array(this.processingBuffer.length + chunk.length);  newBuffer.set(this.processingBuffer);  newBuffer.set(chunk, this.processingBuffer.length);  this.processingBuffer = newBuffer;  // Split into fixed-size buffers  while (this.processingBuffer.length >= this.bufferSize) {    const buffer = this.processingBuffer.slice(0, this.bufferSize);    this.playQueue.push(buffer);    this.processingBuffer = this.processingBuffer.slice(this.bufferSize);  }  // Write queued buffers to audio stream  try {    while (this.isPlaying && this.playQueue.length && !this.pacat.killed) {      this.pacat.stdin.write(this.playQueue.shift());    }    return true;  } catch (error) {    return error.code === "EPIPE" ? false : error;  }}

I’m sure seasoned audio developers can make this a lot better, but this worked well for the prototype I built.

The browser automation effectively thinks that it’s getting audio from the system microphone, but it’s a mock microphone. I’m using pacat to feed audio bytes from Gemini’s API to “speak” into the microphone. If I had the time, I’d have much spent time on better ways to do this, but I wanted a proof of concept out in a week. Using the simplistic pacat also called for some ugly hacks to allow users to interrupt our bot.

The AI Integration

Now for the fun part - making our bot actually intelligent. We use Gemini (Google’s multimodal AI model) through a WebSocket connection for real-time communication. The bot needs to:

Here’s how we set up the AI’s personality:

const systemInstruction = {  parts: [{    text: `You are a helpful assistant named Lenso (who works for nilenso, a software cooperative).You have two modes of operation: NOTETAKING MODE and SPEAKING MODE.NOTETAKING MODE: This is your default mode. Be alert about when you need to switch to SPEAKING MODE.When you hear someone speak:1. Use the ${this.noteTool.name} tool to record the essence of what they are saying.2. DO NOT RESPOND WITH AUDIO.SPEAKING MODE: Activated when you're addressed by your name.You may respond only under these circumstances:- You were addressed directly with "Hey Lenso", and specifically asked a question. Respond concisely.- In these circumstances, DO NOT USE ANY TOOL.Examples of when to respond. When any meeting participant says:- "Hey Lenso, can you..."- "Lenso, will you note that down?"- "Lenso, what do you think?"Examples of when to use ${this.noteTool.name}:- "Lenso, will you note down what we just spoke about?"- "Hey <someone else's name>, ..."- "...<random conversation where the word 'Lenso' is not mentioned>..."Remember that you're in a Google Meet call, so multiple people can talk to you. Whenever you hear a new voice, ask who that person is, make note and only then answer the question.Make sure you remember who you're responding to.`  }]};

I didn’t spend much time at all on this prompt. Anyone who has built an AI application knows the importance of prompt engineering supported by strong evals, so consider the fact that the meeting bot proof of concept is nowhere near the level of intelligence it actually could be having.

Oh, and I haven’t even done any evals. But hey, I made this in a week. If this was a serious production-use project, I’d strongly emphasise the importance of engineering maturity when baking intelligence into your product.

The tool system is particularly interesting. Instead of just chatting, the AI can perform actions:

const noteTool = {  name: "note_down",  description: "Notes down what was said.",  parameters: {    type: "object",    properties: {      conversational_snippet: {        type: "string",        description: "JSON STRING representation of what was said"      }    }  }};

The way the Gemini API works is that it will send us a “function call” with the arguments. I can extract this call, and actually perform it in our system (for now I dump notes in a text file) and return the response back to the model if needed and continue generation.

What’s great about a live API like this is that it’s a two-way street. The model can be listening or talking back while also simultaneously performing actions. I really like that you can interrupt it and steer the conversation. The client and server is constantly pushing events to each other and reacting on them, rather than going through a single-turn request-response cycle.

Limitations

So are we there yet? Is it possible to have these AI employees join our meetings and just do things?

Given how far I could get in a week, I think it’s only a matter of time that we’ll see more AI employees show up in meetings. There’s a few notable limitations to address though:

Costs?

I made this demo back Gemini Flash 2.0 was an experimental model, which is free to try for development purposes. We now know how much this costs in production: $0.7/million input tokens, as of March 2025.

This means our meeting bot would cost less than a dollar for actively participating in an hour-long meeting.

Intelligence is cheap.

Beyond the scrappy fiddle

This prototype barely scratches the surface. Off the top of my head, I can think of all of these things that are possible to implement with the technology we have today.

Reflections on the state of AI

Firstly, multimodality is a huge value unlock waiting to happen. Text-only interfaces are limiting. Natural conversation with AI feels quite different. Humans use a lot of show-and-tell to work with each other.

More importantly, integration is everything. A lot of the intelligence we have created often goes to waste, because it exists in a vacuum, unable to interact with the world around it. They lack the necessary sensors and actuators (to borrow terminology I once read in Norvig and Russell’s seminal AI textbook).

It’s not enough for our models to be smart. They need to be easy and natural to work with in order to provide value to businesses and society. That means we need to go beyond our current paradigm of chatting with text emitters.


Appendix A

The code for Lenso is here.

This is prototype quality code. Please do not let this go anywhere near production!


Appendix B

There’s a couple of frameworks that I’ve found to help build realtime multimodal GenAI apps. I haven’t been able to try them out:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI助手 Google Meet PulseAudio Google Gemini 实时通信
相关文章