OperateAI
AI Automation8 min read

How to Build a WhatsApp AI Agent That Handles Photos, Voice Notes, and PDFs — A Real Case Study

Ajay Singhadiya

Ajay Singhadiya

Founder, OperateAI · Published 10 April 2026 · Last updated: April 2026

⚡ AI Summary — TL;DR

  • 90% of WhatsApp chatbots only handle text — making them useless for how real customers actually communicate.
  • Built a full-stack WhatsApp AI agent for a Jaipur grocery shop in one weekend (~14 hours) that handles photos, voice notes, videos, and PDFs.
  • The system processes 200–300 messages/day with 96% order accuracy and saves the owner 2–3 hours every single day.
  • Total running cost: ₹1,500/month. Traditional equivalent: ₹15–20 lakh and a team of 5 developers.

The Problem with Every WhatsApp Chatbot You've Seen

A grocery shop owner in Jaipur sent me a voice note at 11:47 PM — laughing. A customer had just sent a photo of a handwritten grocery list, the bot read it, replied with the full order and total, and the customer thought the owner was personally typing at midnight.

The bot was built in one weekend.

But before we got there, I had to understand why every existing WhatsApp chatbot had already failed this shop owner.

90% of WhatsApp chatbots only handle text. Voice notes, images, PDFs, videos — ignored. A bot that replies "Sorry, I can only process text messages" to a photo of a grocery list is not a chatbot. It's a glorified auto-reply.

The owner's customers were sending:

  • Handwritten grocery list photos at all hours
  • Voice notes in Hindi, Hinglish, and English — especially late at night
  • Videos of products asking "do you have this?"
  • PDF bulk orders from restaurants and caterers

The owner was manually handling 200–300 messages per day. Every voice note meant stopping, listening, typing a response. 2–3 hours every night. Just WhatsApp.

I took the project on one condition: it would handle everything a customer can send — not just text.


The Architecture: 19 Nodes, 5 Pipelines, One Brain

The system runs entirely on self-hosted N8N. One webhook receives everything WhatsApp sends. From there, a routing filter identifies the message type and sends it down the correct pipeline.

Routing flow:

Message Type Pipeline
Text Directly to AI Agent
Image GPT-4o Vision → Extract items → AI Agent
Voice Note Whisper transcription → AI Agent
Video Gemini analysis → AI Agent
Document (PDF) Text extraction → AI Agent

All five paths converge into the same AI Grocery Agent (Gemini 3.1 Pro), which generates a reply, sends it via WhatsApp API, and logs the order to Google Sheets.

19 nodes total. Every message type handled. Zero manual intervention.


How Each Media Type Actually Works

Images — The Handwritten List Problem

GPT-4o Vision is configured with one job: extract every grocery item, quantity, and brand name from whatever image the customer sends — handwritten lists, product photos, screenshots of someone else's order.

Accuracy: ~94%. The remaining 6% is handwriting that humans can't read either.

Voice Notes — The Midnight Order

Voice notes go to OpenAI Whisper for transcription, then straight to the AI agent.

A real example the system handled: "Bhai, 2 kg aloo, 1 litre doodh, ek packet Maggi, aur haan — kya paneer hai? 200 gram bhej dena."

The agent receives that transcription, identifies each item, checks prices, calculates the total, and replies — in Hindi — within seconds. It handles Hindi, Hinglish, and English natively, switching based on what the customer sends.

This single feature saved the owner 2+ hours per night of manually listening and typing.

Videos — "Do You Have This?"

Customers send videos of products they found elsewhere, asking if the shop stocks them. Gemini 3.1 Pro handles video analysis by extracting key frames and identifying the products.

Documents — The Bulk Order PDF

Restaurants, caterers, and small businesses send PDF order lists. The workflow downloads the file, extracts the text, and processes it identically to any other order. This became the most profitable feature — high-value B2B orders fully automated.


The AI Brain

The core agent runs on Gemini 3.1 Pro with a structured system prompt containing everything it needs to operate the shop:

  • Complete product catalog with prices (Rice ₹95/kg, Paneer ₹90/200g, Onion ₹35/kg)
  • Store hours, delivery radius, accepted payment methods
  • Delivery rules — free above ₹500, ₹40 fee otherwise, 30–60 minutes within 5 km
  • Language instruction — reply in whatever language the customer uses
  • Order confirmation flow — itemized list, total, confirmation request, address collection

Conversation Memory: N8N's Window Buffer Memory, keyed to each customer's phone number. The bot remembers what was said earlier in the conversation. This solves the failure point of 90% of WhatsApp bots — the ones that forget your last message the moment you send another.


The Bug That Almost Killed It on Launch Day

Launch day. Bot deployed. First real messages coming in.

Complete silence. No replies. No errors in the logs.

Three hours of debugging later, the issue: WhatsApp sends "status update" webhooks — delivered notifications, read receipts — through the exact same endpoint as customer messages. The bot was receiving these, trying to process them as customer orders, generating a reply to a delivery notification, which triggered another status update, which the bot tried to process again.

An infinite loop. The bot was talking to itself.

The fix was a single filter node at the top of the workflow: Is this an actual message from a customer, or a status notification?

One node. Three hours of debugging.

Add this filter before anything else. Always.


Real Numbers

Metric Value
Messages processed per day 200–300
Non-text media (voice, image, video, PDF) 35% of all messages
Average response time 3–5 seconds
Order accuracy 96%
Hours saved per day 2–3 hours
Monthly running cost ₹1,500
Build time 1 weekend (~14 hours)
Image recognition accuracy ~94%
Traditional development equivalent ₹15–20 lakh + 5 developers

The owner's words: "Mere customers sochte hain ki maine koi banda rakha hai jo raat ko bhi reply karta hai."

(My customers think I've hired someone who replies even at night.)


Tech Stack

Component Tool
Workflow orchestration N8N (self-hosted)
Messaging WhatsApp Business Cloud API
Image analysis GPT-4o Vision
Voice transcription OpenAI Whisper
Core AI agent + video Gemini 3.1 Pro
Order logging Google Sheets

Total API cost: ~₹1,500/month. Two years ago, this system would have required a team of 5 developers, a custom mobile app, and a minimum of ₹15–20 lakh to build.

If you're wondering which of these stack choices would fit your business, our complete AI tools guide for B2B SMBs walks through the decision framework, and how to automate lead follow-up with AI is the highest-ROI next build after a WhatsApp agent.


What I'd Do Differently

1. Use a real database from day one. Hardcoding the product catalog inside the AI prompt works for a small shop. It breaks the moment the catalog grows or prices change. Postgres from the start — products as rows, prices updatable without touching the AI prompt.

2. Build order storage properly. The current system logs to Google Sheets. V2 needs proper order numbers, delivery status tracking, and billing integration. Sheets was fine for an MVP. It's not fine for a shop doing real volume.

3. Test with real customers earlier. Real users did things no test scenario predicted: emoji orders (3 🍎 meant "3 apples"), multiple separate orders in one message, and one customer who sent a screenshot of someone else's order saying "same thing." Real usage is always stranger than your test cases.


Want This Built for Your Business?

This system — 19 nodes, 5 media pipelines, AI-powered order processing with conversation memory — runs on a self-hosted N8N instance for ₹1,500/month in API costs.

If your business runs on WhatsApp and your team is spending hours manually responding to messages, this is exactly the kind of system we build at OperateAI.

Book a Free Automation Audit →


Originally published on LinkedIn

Want help implementing this for your business?

Book a free 30-minute AI audit. We'll show you exactly what to automate and in what order.

⚡ Get Your Automation Plan →