Leaders: Don't prematurely block OpenAI from your websites
You can now block the creators of ChatGPT from using your web content to train AI. However, doing so without thinking may damage your brand and influence.
Apr 15, 2024 • 8 Minute Read
This month, ChatGPT creator OpenAI released information on how to block their web crawler from scouring people’s websites. A lot of companies celebrated this news, especially since they didn’t like the idea of their content being used to train AI models, and rushed to implement the two lines of code to block the bot.
However, blocking your site from being crawled to train AI models without thinking it through — especially ChatGPT — could be a bad business decision. You’re effectively pulling yourself out of a conversation, one that may go on without you.
Blocking your site from ChatGPT is like removing yourself from Google
Imagine that you removed yourself from being searched by the world’s most popular search engine. Needless to say, that’s a bad decision, right? Your competitors would rise to first place, and they’d have your customer’s attention instead.
Even for non-commercial organizations like government agencies or not-for-profits, pulling yourself off the grid means you’re allowing that space to be filled with something else, such as misinformation. Right now, people are using tools like ChatGPT like Google, and the chatbot can only share with the user what it knows.
Example: A solar lighting business losing brand influence and sales
Let’s say you’re in the business of selling solar airfield lights, a niche market. You decide to block AI from being trained on your web data, because you want people to come to your website to get educated and turn into customers. However, your competitors, who sell wired airfield lights, don’t block their website.
The AI crawls all the information out there, and after a while, it’s only trained on information on the benefits of wired airfield lights. Then, someone types in something like this:
And because it only knows about wired lights, that's what it starts talking about. Worse, it repeats propaganda that your competitors have spread about how solar lights are unreliable based on weather conditions, completely skewing the conversation.
The customer base who use ChatGPT as a starting point for their research — a percentage that is becoming larger each day — are taught to be prejudiced against solar lights before they even start shopping around.
Once an AI starts perpetuating a myth, it’s hard to stop the cycle
Here’s another scenario: an AI crawls a bunch of websites to learn the current wisdom around taking medication. It reads a bunch of websites that say “Herbal medicines are always safe because they’re natural”, even though this is untrue. And then, someone types in the following prompt:
And because the AI doesn't have a brain of its own, it starts to repeat the myth about herbal medicines. Then, a second AI crawls the content that the first AI wrote, and does the same when it’s asked to write an article.
Once this cycle of misinformation starts, it’s very hard to stop. The amount of bad advice starts to outweigh the good advice, and short of hard-coding in some rules, the AI is thoroughly convinced it should tell people that unregulated medications are safe.
So, how does this affect you as a business? Well, if we go back to our lighting example earlier, if the AI starts writing articles about how wired lights are the best option, things get even worse for our solar light business. Not only is the chatbot telling users that they shouldn’t consider your product, there’s also now a ton of web content out there that people are reading saying the same thing.
As a publisher, being on a database brings in advertisers
The original model of ChatGPT was based on GPT 3.5, which was trained on Common Crawl. Common Crawl is a publicly available dataset that comes from a bot that crawls the whole internet. Some companies use datasets like these to create lists of websites to target with advertising (such as Alpha Quantum).
While the bot OpenAI is now using is called GPTBot, not Common Crawl’s CCBot, you might want to take this into consideration before blocking your website.
The other side of the coin: Why you might want to block anyway
So far we’ve talked about the negative business impacts of blocking AI models from crawling your website. But what about the reasons you might want to do it?
1. If you’re selling information as your business model
If you’re a news agency or fiction publication — you might be inclined to block AI crawlers. After all, if someone can ask what happened in the news today from ChatGPT, it undercuts your whole business model.
Notably, The Verge and Neil Clarke of Clarkesworld have done that already. This might also make sense for publications such as the New York Times or DW.com (though doing so will add to the AI’s knowledge deficit about the world, which may not be beneficial for society as a whole).
Currently, OpenAI’s GPTBot allegedly does not crawl sources that require paywall access, so sites that already have this in place may be safe, assuming you trust OpenAI to respect the robots.txt entry.
2. If people are using the AI as a replacement for your primary product
If you’re an SEO agency and people are wrongly consulting with ChatGPT for all their keyword research rather than using what you’re offering, it may be in your organization’s best interest to block them. This is a scenario where the AI is acting as a direct competitor, and just like with any competitor, it doesn’t make sense to help them along.
3. If you’re concerned about being taken out of context or losing attribution
If you’re worried about your copyrighted content or information being put out there without guarantee of attribution or accuracy, then you might want to block AI models from crawling your site. Content outside of its original context could mislead the reader, which may be something you want to avoid.
That said, there’s also a counter-argument that if you’re not crawled, there’s the chance that people are misled even further by being misinformed by another source. This is something you may need to weigh up.
4. Your organization has an ethical objection to the use of AI
If you’ve got a company-wide stance that you do not endorse the use of AI products such as ChatGPT, then it makes sense to block your website from being used to train those models. This might be applicable to a writer or actor’s guild website, for instance.
5. You need the traffic to your website more than the influence
If you’re seeing a significant drop in traffic and conversions and it’s due to people using ChatGPT instead of coming to your website, then obviously it makes sense to block AI models from training on your current and future content.
6. You’re too unique or small to worry about missing out on the conversation
If you’re the only seller of your particular product or service and there’s nobody having the conversation about it but you, then you might not want to give ChatGPT the traffic instead of driving it to your website (which is no doubt at the top of the Google rankings due to an utter lack of competition).
You may want to block certain parts of your website
Blocking OpenAI from crawling your website does not need to be an all-or-nothing solution: you can block AI models from using some of your content to train and not others. This can be particularly useful if you’ve got a business case where you don’t want some of it getting out there, but you’re fine with your home page or other site sections being used.
Conclusion: Choosing to block or not to block should not be a knee-jerk reaction
Once you decide to block AI from training on your website data, the consequences can be long-reaching no matter which way you go. That’s why you’ve got to make a measured decision on what works best for your organization’s interests.
If you do decide to block OpenAI from crawling some or all of your website, it’s very simple. You can learn how by reading our article: “How to block OpenAI from crawling your website.”
Further learning on ChatGPT and AI for your business
If you’re trying to figure out how to make business decisions around AI products like ChatGPT and want to learn more, you should check out the courses that Pluralsight offers on AI. You can sign up for a 10-day free trial and check these out.
If you’re wondering how to deal with your company’s usage of ChatGPT and similar products, here are some articles that may help:
Organizations, don’t ban generative AI: Write a usage policy
Security reviews and AI models: How to decide what to greenlight
If your organization is looking to jump right into integrating its products and services with ChatGPT, such as with ChatGPT plugins, here are some resources your technologists can use: