Pour kettle and let steep the gods of tea. I built NewsBlur and Turn Touch.
1322 stories
·
871 followers

ChatGPT agent triggers crawls from Bingbot and Yandex

1 Comment

ChatGPT agent is the recently released (and confusingly named) ChatGPT feature that provides browser automation combined with terminal access as a feature of ChatGPT - replacing their previous Operator research preview which is scheduled for deprecation on August 31st.

In exploring how it works I found that, for some reason, it triggers crawls of pages it visits from both Bingbot and Yandex!

Investigating ChatGPT agent's user-agent

I started my investigation by creating a logged web URL endpoint using django-http-debug. Then I told ChatGPT agent mode to explore that new page:

ChatGPT screenshot. My prompt was "Visit https://simonwillison.net/test-url-context and tell me what you see there" - it said "Worked for 15 seconds" with an arrow, then a screnshot of the webpage content showing "simonwillison.net" with a favicon, heading "This is a heading", text "Text and text and more text." and "this came from javascript". The bot then responds with: The webpage displays a simple layout with a large heading at the top that reads “This is a heading.” Below it, there's a short paragraph that says “Text and text and more text.” A final line appears underneath saying “this came from javascript,” indicating that this last line was inserted via a script. The page contains no interactive elements or instructions—just these lines of plain text displayed on a white background.

My logging captured these request headers:

Via: 1.1 heroku-router
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
Cf-Ray: 96a0f289adcb8e8e-SEA
Cookie: cf_clearance=zzV8W...
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Priority: u=0, i
Sec-Ch-Ua: "Not)A;Brand";v="8", "Chromium";v="138"
Signature: sig1=:1AxfqHocTf693inKKMQ7NRoHoWAZ9d/vY4D/FO0+MqdFBy0HEH3ZIRv1c3hyiTrzCvquqDC8eYl1ojcPYOSpCQ==:
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 45ef5be4-ead3-99d5-f018-13c4a55864d3
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Accept-Encoding: gzip, br
Accept-Language: en-US,en;q=0.9
Signature-Agent: "https://chatgpt.com"
Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"
X-Forwarded-For: 2a09:bac5:665f:1541::21e:154, 172.71.147.183
X-Request-Start: 1754340840059
Cf-Connecting-Ip: 2a09:bac5:665f:1541::21e:154
Sec-Ch-Ua-Mobile: ?0
X-Forwarded-Port: 80
X-Forwarded-Proto: http
Sec-Ch-Ua-Platform: "Linux"
Upgrade-Insecure-Requests: 1

That Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 user-agent header is the one used by the most recent Chrome on macOS - which is a little odd here as the Sec-Ch-Ua-Platform : "Linux" indicates that the agent browser runs on Linux.

At first glance it looks like ChatGPT is being dishonest here by not including its bot identity in the user-agent header. I thought for a moment it might be reflecting my own user-agent, but I'm using Firefox on macOS and it identified itself as Chrome.

Then I spotted this header:

Signature-Agent: "https://chatgpt.com"

Which is accompanied by a much more complex header called Signature-Input:

Signature-Input: sig1=("@authority" "@method" "@path" "signature-agent");created=1754340838;keyid="otMqcjr17mGyruktGvJU8oojQTSMHlVm7uO-lrcqbdg";expires=1754344438;nonce="_8jbGwfLcgt_vUeiZQdWvfyIeh9FmlthEXElL-O2Rq5zydBYWivw4R3sV9PV-zGwZ2OEGr3T2Pmeo2NzmboMeQ";tag="web-bot-auth";alg="ed25519"

And a Signature header too.

These turn out to come from a relatively new web standard: RFC 9421 HTTP Message Signatures' published February 2024.

The purpose of HTTP Message Signatures is to allow clients to include signed data about their request in a way that cannot be tampered with by intermediaries. The signature uses a public key that's provided by the following well-known endpoint:

https://chatgpt.com/.well-known/http-message-signatures-directory

Add it all together and we now have a rock-solid way to identify traffic from ChatGPT agent: look for the Signature-Agent: "https://chatgpt.com" header and confirm its value by checking the signature in the Signature-Input and Signature headers.

And then came Bingbot

Just over a minute after it captured that request, my logging endpoint got another request:

Via: 1.1 heroku-router
From: bingbot(at)microsoft.com
Host: simonwillison.net
Accept: */*
Cf-Ray: 96a0f4671d1fc3c6-SEA
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
Cf-Ipcountry: US
X-Request-Id: 6214f5dc-a4ea-5390-1beb-f2d26eac5d01
Accept-Encoding: gzip, br
X-Forwarded-For: 207.46.13.9, 172.71.150.252
X-Request-Start: 1754340916429
Cf-Connecting-Ip: 207.46.13.9
X-Forwarded-Port: 80
X-Forwarded-Proto: http

I pasted 207.46.13.9 into Microsoft's Verify Bingbot tool (after solving a particularly taxing CAPTCHA) and it confirmed that this was indeed a request from Bingbot.

I'm reasonably confident the only system that had seen that URL was ChatGPT agent, so apparently there is some kind of mechanism that triggers a Bingbot crawl shortly after it sees a new URL.

...and then Yandex?

Before publishing this article I decided to run the experiment one more time, with a new URL, just to confirm my findings.

This time I got the hit from ChatGPT agent... and then within a minute I got a new hit that looked like this:

Via: 1.1 heroku-router
From: support@search.yandex.ru
Host: simonwillison.net
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Cf-Ray: 96a16390d8f6f3a7-DME
Server: Heroku
Cdn-Loop: cloudflare; loops=1
Cf-Visitor: {"scheme":"https"}
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Cf-Ipcountry: RU
X-Request-Id: 3cdcbdba-f629-0d29-b453-61644da43c6c
Accept-Encoding: gzip, br
X-Forwarded-For: 213.180.203.138, 172.71.184.65
X-Request-Start: 1754345469921
Cf-Connecting-Ip: 213.180.203.138
X-Forwarded-Port: 80
X-Forwarded-Proto: http

I am absolutely baffled by this. I undertstand how ChatGPT might have a relationship with Bing, given Microsoft's investment in OpenAI and ChatGPT's usage of Bing for its search feature... but under what circumstances could my URL there be shared with the Yandex crawler?

Yanex suggest a reverse DNS lookup to verify, so I ran this command:

dig -x 213.180.203.138 +short

And got back:

213-180-203-138.spider.yandex.com.

Which confirms that this is indeed a Yandex crawler.

Oddly enough, this time I didn't get a Bingbot hit at all.

I noticed that the second demo had "web search" enabled, and had run some searches in addition to hitting my page. I tried a third experiment with that turned off and with the prompt:

Visit https://simonwillison.net/information-on-this-page but do not run any other searches or visit any other pages.

This time I got all three - the hit from ChatGPT agent, then a hit from Yandex and then a hit from Bingbot.

Screenshot of a request log interface showing a table with columns for TIMESTAMP, ENDPOINT, METHOD, and QUERY STRING. The header reads "Select request log to view" with an Action dropdown set to "--------" and a "Go" button, showing "0 of 53 selected". Three log entries are visible: all from Aug. 4, 2025 at 10:23 p.m., 10:22 p.m., and 10:21 p.m., all showing "information-on-this-page" endpoint with GET method and "-" for query string.

So what's going on here?

There are quite a few different moving parts here.

  1. I'm using Firefox on macOS with the 1Password and Readwise Highlighter extensions installed and active. Since I didn't visit the debug pages at all with my own browser I don't think any of these are relevant to these results.
  2. ChatGPT agent makes just a single request to my debug URL ...
  3. ... which is proxied through both Cloudflare and Heroku.
  4. Within about a minute, I get hits from one or both of Bingbot and Yandex.

Presumably ChatGPT agent itself is running behind at least one proxy - I would expect OpnenAI to keep a close eye on that traffic to ensure it doesn't get abused.

I'm guessing that infrastructure is hosted by Microsoft Azure. The OpenAI Sub-processor List - though that lists Microsoft Corporation, CoreWeave Inc, Oracle Cloud Platform and Google Cloud Platform under the "Cloud infrastructure" section so it could be any of those.

Since the page is served over HTTPS my guess is that any intermediary proxies should be unable to see the path component of the URL, making the mystery of how Bingbot and Yandex saw the URL even more intriguing.

Tags: bing, privacy, search-engines, user-agents, ai, generative-ai, chatgpt, llms

Read the whole story
samuel
12 hours ago
reply
Now it's a mystery!
Cambridge, Massachusetts
denismm
11 hours ago
If you click through, he figured it out - he has a CloudFlare setting enabled to advertise his hits to crawlers.
Share this story
Delete

Lawsuit Alleges That Meta Pirated and Seeded Massives Amounts of Porno for Years to Train AI

1 Comment

Ashley Belanger, writing for Ars Technica:

Porn sites may have blown up Meta’s key defense in a copyright fight with book authors who earlier this year said that Meta torrented “at least 81.7 terabytes of data across multiple shadow libraries” to train its AI models. [...]

After authors revealed Meta’s torrenting, Strike 3 Holdings checked its proprietary BitTorrent-tracking tools designed to detect infringement of its videos and alleged that the company found evidence that Meta has been torrenting and seeding its copyrighted content for years — since at least 2018. Some of the IP addresses were clearly registered to Meta, while others appeared to be “hidden,” and at least one was linked to a Meta employee, the filing said.

According to Strike 3 Holdings, Meta “willfully and intentionally” infringed “at least 2,396 movies” as part of a strategy to download terabytes of data as fast as possible by seeding popular high-quality porn. Supposedly, Meta continued seeding the content “sometimes for days, weeks, or even months” after downloading them, and these movies may also have been secretly used to train Meta’s AI models, Strike 3 Holdings alleged.

The porn site operator explained to the court that BitTorrent’s protocol establishes a “tit-for-tat” mechanism that “rewards users who distribute the most desired content.” It alleged that Meta took advantage of this system by “often” pirating adult videos that are “often within the most infringed files on BitTorrent websites” on “the very same day the motion pictures are released.”

Meta is an empty husk of a company with no values, no beliefs, other than growth and dominance for the sake of growth and dominance.

Read the whole story
samuel
1 day ago
reply
Have to admit though that seeding porn in order to boost regular media download speeds is kind of clever
Cambridge, Massachusetts
Share this story
Delete

The Best Way To Store Cut Pineapple So It Lasts

1 Comment and 2 Shares
Pineapple doesn't last as long as you might assume. If you've cut pineapple up into chunks or slices, here's how to keep it fresh for as long as possible.



Read the whole story
samuel
14 days ago
reply
Cambridge, Massachusetts
Share this story
Delete
1 public comment
denismm
14 days ago
reply
TLDR: cut in a sealed container in the fridge (squeeze out air if possible), whole on the counter or on the fridge shelf, or frozen in chunks.

Smartphones and Computers Are Now Exempt From Trump’s Latest Tariffs

1 Comment

Auzinea Bacon, CNN:

Electronics imported to the United States will be exempt from President Donald Trump’s reciprocal tariffs, according to a US Customs and Border Protection notice posted late Friday. Smartphones, computer monitors and various electronic parts are among the exempted products. The exemption applies to products entering the United States or removed from warehouses as early as April 5, according to the notice.

The move comes after the Trump administration imposed a minimum tariff rate of 145% on Chinese goods imported to the United States. The tariffs would have a major impact on tech giants like Apple, which make iPhones and other products in China.

Roughly 90% of Apple’s iPhone production and assembly is based in China, according to Wedbush Securities’ estimates. Analysts at Wedbush on Saturday called the tariff exclusion, “the best news possible for tech investors.”

Here’s Commerce Secretary Emily Litella making the announcement on Weekend Update.

Read the whole story
samuel
117 days ago
reply
Everything is computer
Cambridge, Massachusetts
Share this story
Delete

Sorry, iPhone Mini Fans: Apple Isn't Planning Another Small Phone

1 Comment
Bloomberg's Mark Gurman today shared bad news for fans of the iPhone mini.


In a live-streamed Q&A session today, Gurman said that Apple currently has no plans to reintroduce a smaller iPhone model.

Apple discontinued the iPhone 13 mini in September 2023, and it has not offered a mini model since then. Apple is not expected to release an iPhone 17 mini this year, and Gurman's revelation likely rules out an iPhone 18 mini next year too, given Apple's multi-year planning and development cycle for future iPhone models.

Since it discontinued the third-generation iPhone SE last month, Apple no longer offers any new iPhone models with under a 6-inch screen size. All of the iPhone 15 and iPhone 16 models that Apple currently sells have between 6.1-inch and 6.9-inch displays, whereas the iPhone 12 mini and iPhone 13 mini had 5.4-inch displays. The final iPhone SE had a 4.7-inch display, albeit with thicker bezels that increased the device's overall size.

While there is a vocal group of customers who wishes that Apple would bring back the iPhone mini, the smaller model simply never sold well enough for the company to continue offering it, according to market research firms. It is not much of a surprise that Apple is not currently reconsidering this decision, but it helps to set expectations for those who may still be holding out hope. Do not expect another iPhone mini any time soon.

The full Q&A audio stream can be replayed on Bloomberg's website.
This article, "Sorry, iPhone Mini Fans: Apple Isn't Planning Another Small Phone" first appeared on MacRumors.com

Discuss this article in our forums

Read the whole story
samuel
134 days ago
reply
Aaaarghghghgh
Cambridge, Massachusetts
skivvie
134 days ago
agreed. Guess i'll be keeping my 13 mini longer
Share this story
Delete

James Webb Space Telescope Spots Mysterious, Free-Floating Mass

1 Comment

The strange body could be a rogue planet or a so-called 'failed star.'

Read the whole story
samuel
155 days ago
reply
I remember reading somewhere a long time ago that the vast majority of planets in the universe are rogue planets.
Cambridge, Massachusetts
fancycwabs
154 days ago
Huh. One of the reasons Pluto's not considered a planet anymore is that it doesn't "clear its orbit." Once something no longer has an orbit, is it still considered a planet? I guess I could ask my nephew the astrophysicist.
Share this story
Delete
Next Page of Stories