ChatGPT agent is the recently released (and confusingly named) ChatGPT feature that provides browser automation combined with terminal access as a feature of ChatGPT - replacing their previous Operator research preview which is scheduled for deprecation on August 31st.
In exploring how it works I found that, for some reason, it triggers crawls of pages it visits from both Bingbot and Yandex!
Investigating ChatGPT agent's user-agent
I started my investigation by creating a logged web URL endpoint using django-http-debug. Then I told ChatGPT agent mode to explore that new page:
My logging captured these request headers:
Via: 1.1 heroku-router
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Cf-Ray: 96a0f289adcb8e8e-SEA
Cookie: cf_clearance=zzV8W...
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Priority: u=0, i
Sec-Ch-Ua: "Not)A;Brand";v="8", "Chromium";v="138"
Signature: sig1=:1AxfqHocTf693inKKMQ7NRoHoWAZ9d/vY4D/FO0+MqdFBy0HEH3ZIRv1c3hyiTrzCvquqDC8eYl1ojcPYOSpCQ==:
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 45ef5be4-ead3-99d5-f018-13c4a55864d3
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Accept-Encoding: gzip, br
Accept-Language: en-US,en;q=0.9
Signature-Agent: "https://chatgpt.com"
Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"
X-Forwarded-For: 2a09:bac5:665f:1541::21e:154, 172.71.147.183
X-Request-Start: 1754340840059
Cf-Connecting-Ip: 2a09:bac5:665f:1541::21e:154
Sec-Ch-Ua-Mobile: ?0
X-Forwarded-Port: 80
X-Forwarded-Proto: http
Sec-Ch-Ua-Platform: "Linux"
Upgrade-Insecure-Requests: 1
That Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 user-agent header is the one used by the most recent Chrome on macOS - which is a little odd here as the Sec-Ch-Ua-Platform : "Linux" indicates that the agent browser runs on Linux.
At first glance it looks like ChatGPT is being dishonest here by not including its bot identity in the user-agent header. I thought for a moment it might be reflecting my own user-agent, but I'm using Firefox on macOS and it identified itself as Chrome.
Then I spotted this header:
Signature-Agent: "https://chatgpt.com"
Which is accompanied by a much more complex header called Signature-Input:
Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"
And a Signature
header too.
These turn out to come from a relatively new web standard: RFC 9421 HTTP Message Signatures' published February 2024.
The purpose of HTTP Message Signatures is to allow clients to include signed data about their request in a way that cannot be tampered with by intermediaries. The signature uses a public key that's provided by the following well-known endpoint:
https://chatgpt.com/.well-known/http-message-signatures-directory
Add it all together and we now have a rock-solid way to identify traffic from ChatGPT agent: look for the Signature-Agent: "https://chatgpt.com"
header and confirm its value by checking the signature in the Signature-Input
and Signature
headers.
And then came Bingbot
Just over a minute after it captured that request, my logging endpoint got another request:
Via: 1.1 heroku-router
From: bingbot(at)microsoft.com
Host: simonwillison.net
Accept: */*
Cf-Ray: 96a0f4671d1fc3c6-SEA
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 6214f5dc-a4ea-5390-1beb-f2d26eac5d01
Accept-Encoding: gzip, br
X-Forwarded-For: 207.46.13.9, 172.71.150.252
X-Request-Start: 1754340916429
Cf-Connecting-Ip: 207.46.13.9
X-Forwarded-Port: 80
X-Forwarded-Proto: http
I pasted 207.46.13.9
into Microsoft's Verify Bingbot tool (after solving a particularly taxing CAPTCHA) and it confirmed that this was indeed a request from Bingbot.
I'm reasonably confident the only system that had seen that URL was ChatGPT agent, so apparently there is some kind of mechanism that triggers a Bingbot crawl shortly after it sees a new URL.
...and then Yandex?
Before publishing this article I decided to run the experiment one more time, with a new URL, just to confirm my findings.
This time I got the hit from ChatGPT agent... and then within a minute I got a new hit that looked like this:
Via: 1.1 heroku-router
From: support@search.yandex.ru
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Cf-Ray: 96a16390d8f6f3a7-DME
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Cf-Ipcountry: RU
X-Request-Id: 3cdcbdba-f629-0d29-b453-61644da43c6c
Accept-Encoding: gzip, br
X-Forwarded-For: 213.180.203.138, 172.71.184.65
X-Request-Start: 1754345469921
Cf-Connecting-Ip: 213.180.203.138
X-Forwarded-Port: 80
X-Forwarded-Proto: http
I am absolutely baffled by this. I undertstand how ChatGPT might have a relationship with Bing, given Microsoft's investment in OpenAI and ChatGPT's usage of Bing for its search feature... but under what circumstances could my URL there be shared with the Yandex crawler?
Yanex suggest a reverse DNS lookup to verify, so I ran this command:
dig -x 213.180.203.138 +short
And got back:
213-180-203-138.spider.yandex.com.
Which confirms that this is indeed a Yandex crawler.
Oddly enough, this time I didn't get a Bingbot hit at all.
I noticed that the second demo had "web search" enabled, and had run some searches in addition to hitting my page. I tried a third experiment with that turned off and with the prompt:
Visit https://simonwillison.net/information-on-this-page but do not run any other searches or visit any other pages.
This time I got all three - the hit from ChatGPT agent, then a hit from Yandex and then a hit from Bingbot.
So what's going on here?
There are quite a few different moving parts here.
- I'm using Firefox on macOS with the 1Password and Readwise Highlighter extensions installed and active. Since I didn't visit the debug pages at all with my own browser I don't think any of these are relevant to these results.
- ChatGPT agent makes just a single request to my debug URL ...
- ... which is proxied through both Cloudflare and Heroku.
- Within about a minute, I get hits from one or both of Bingbot and Yandex.
Presumably ChatGPT agent itself is running behind at least one proxy - I would expect OpnenAI to keep a close eye on that traffic to ensure it doesn't get abused.
I'm guessing that infrastructure is hosted by Microsoft Azure. The OpenAI Sub-processor List - though that lists Microsoft Corporation, CoreWeave Inc, Oracle Cloud Platform and Google Cloud Platform under the "Cloud infrastructure" section so it could be any of those.
Since the page is served over HTTPS my guess is that any intermediary proxies should be unable to see the path component of the URL, making the mystery of how Bingbot and Yandex saw the URL even more intriguing.
Tags: bing, privacy, search-engines, user-agents, ai, generative-ai, chatgpt, llms