The Reliable Market Signals We Used to Mine from Online Data are Disappearing

Twenty years ago I started an intelligence firm that mined digital and social media for market insights. This was long before anyone thought of building profile dossiers on individual consumers to tune more efficient spamming and scamming. We just downloaded articles, blog posts and forum discussions to analyze market trends, similar to how Reddit is now used for AI model training.

The signs of digital grifting and gamesmanship were already a feature of the landscape, and it's the evolution of that grift that now gives me pause about the future.

Back then, my first customer was Toyota. They were planning the rollout of the first plug-in Prius in North America and were interested in whether our intelligence could identify anything that wasn't on their radar. We did, which I'll get to, but the real discovery was what I learned about the data we were collecting, and how media was already being gamed in those early days.

At the time, our crawlers were still rudimentary, and I was doing a lot of manual work. I spent the first month of that contract virtually locked in my office 14 hours a day googling, downloading and marking up thousands of pages of blog posts and forum discussions about hybrid cars.

What shocked me then was the number of pages that would appear in Google results, but when I would visit the pages, the content made no sense. The words on the page were gibberish, peppered every few lines with obvious SEO target keywords. Other than those keywords, the content was complete garbage, designed to manipulate Google's PageRank system for advertising traffic.

Those garbage pages were one of our first big hurdles, eating up crawler cycles and stuffing our database with junk. The first algorithms that became the basis of our patent were developed in response to that challenge.

Just as we developed filters for the junk, however, Google updated PageRank for the same reason and the game evolved. One of the filters Google developed was apparently a syntax parser. Now, instead of garbage pages stuffed with keywords, I would find pages that looked real at first glance, with properly formatted paragraphs and real sentences. But by the time I read to the middle of the second sentence, I couldn't understand what was being said. The grammatical structure was correct, but the meaning nonsensical, and of course, each paragraph was stuffed with keywords.

Grifters had advanced their game in the face of a syntax filter by adding a grammar engine to their word blenders. Dump in some target keywords and the output would be apparently correct sentences devoid of any real meaning. The Adwords continued to flow.

And this is roughly how the arms race has continued over the past two decades. Grifters design for the algorithms that drive traffic, the algorithms improve to limit the grift, and the grifters improve the content just enough to game the algorithms again. All the players were constrained by the limitations of machine learning for grammar, syntax and keyword filters, and other related concepts short of actually being able to programmatically determine the meaning of text on a page.

When I delivered our intelligence analysis to Toyota, what we discovered was a demographic subsector that was a new feature of all the plug-in hybrid hype. Among thousands of conversations we found consumers focused on conversations around miles-per-gallon and miles-per-charge. Whichever brand delivered the best distance with a plug-in would be the winner.

But there was a wrinkle. A large number of buyers had been closely watching the industry and were aware of the shortcomings of previous electric vehicles. There had emerged a threshold that many were waiting to see passed before they would buy: they wanted a minimum of 40 miles on a charge, and a charger that could plug into an outlet without needing an electrician to install a transformer.

The forums were full of well-informed early adopters who were happy to sit on their wallets until this minimum viable solution was rolled out, and with our intelligence mining we were able to substantiate the audience. Since Toyota's plug-in Prius met the requirement, our recommendation was simple: focus on the early adopters as the ICP and invite them to the dealership for an advanced preview of the new Prius and keep them in the loop on progress to reward their interest in early-adopter research.

That kind of research depended on real people, talking to other real people, in public, about problems they actually had. The grift was an annoying layer of garbage results in Google search results, but the underlying signal, the real conversations online, wasn’t threatened by the content delivery arms race.

That all changed with Generative Pretrained Transformers, GPTs, in which the technology leaped from word embeddings to what’s called “attention”, or the meaning of words in context. That’s what has allowed AI to suddenly be able to generate content that sounds real. Now anyone can generate pages that are not only grammatically correct, but semantically correct and even compelling, with sophisticated SEO-gaming strategies embedded, along with advanced social engineering and media linking strategies, all ready to post as polished HTML with little more thought than a prompt.

It's not just that thousands of pages are being artificially generated every minute that seem as real as any other page and clutter up the media landscape. The more concerning issue is the entrance of agents being unleashed to do this content generation, and the tactics they're developing to gain leverage in the game.

Because the real game isn't about producing good content to help consumers, the game is about winning traffic and gaining leverage over buyers.

Where I used to visit posts and analyze grammar to see if it's real, now I'm seeing entire conversations engineered between content bots to give the illusion of thinking people debating real solutions. Just like the old days where I could watch tactics change with each update of Google's algorithm, now I can watch tactics change with the latest LLM updates.

A few months ago AI posts were painfully obvious, with brand-new profiles repeatedly posting the same obvious spam posts to promote a product. Now, I'm seeing far more sophisticated tactics. One is karma-farming, in which agents create posts designed only to drive up profile metrics for future promotional work. Another is the personal case study, in which an agent professes a very detailed problem and then the supposedly organic solution they discovered, which of course just happens to be the only product mentioned in the post. This scheme has now evolved to one agent posting a problem, and other agents posting comments that broaden out the conversation until another agent comes in and posts a solution.

This is particularly evident on Reddit, where AI slop is starting to crowd out real conversation, and moderation is clearly not up to the task to identifying LLM content. In fact, there are few incentives to do so as the traffic drives up engagement metrics and ad revenue.

The change underway that's hard to see is that, in the past, grifters were primarily focused on gaming the algorithms to drive traffic. As agents take over more of the content generation and posting activity, the focus is shifting from gaming the algorithms to gaming the humans. The type of social engineering that has long driven scam telemarketers is now being imported for agents to deploy on social media.

What makes this different from the previous cycles is that it’s not just another turn of the wheel, it’s the wheel falling off. In the past, grifters were gaming the search engine algorithms while the search engines themselves were incentivized to contain the grift to keep search results relevant. Now, the manipulation is bypassing the algorithm and targeting users directly, while the platforms are incentivized to allow it in order to boost engagement metrics.

Without spinning up Black Mirror projections of how this might play out in some future horror, it's not hard to understand the damage being done today. Much of our intelligence about markets and consumer trends comes from listening to real people discuss problems and solutions online. Much of AI's training comes from forums like Reddit. But the race to AI is poisoning the well:

Users are rapidly converting from posting their problems and searching for answers in dialog with other people to just asking a GPT—an "answer engine" that conveniently cuts out all the public discussion of the problem. StackOverflow is already reporting a catastrophic drop in questions and answers online as users flock to AI.
Search engines like Google are aggressively using AI summaries to keep traffic on their platforms, instead of distributing traffic to websites where the user might engage in solution discussions. Super convenient… for now. As fewer users hash through problems and solutions with other users, the very content that makes those AI summaries possible will disappear.

The industry’s answer is synthetic data, RLHF and model-generated training content to fill the gap. That may be plausible for fine-tuning a model on known tasks, but it’s no substitute for the unique signal that comes from real people describing novel problems AI hasn’t yet encountered—which are legion in an economy based on constantly innovating new products and solutions.

AI flooding social networks with artificial content not only displaces the real content that trains AI and informs readers, but the slop and uncertainty becomes another damper on participation. How does it feel to spend 15 minutes earnestly answering a question only to realize you've been duped by a karma-farming bot?

I'm not at all convinced that we could reliably generate the kind of research we delivered to Toyota today, given the flood of generated content and its impact on market dialog online. I fear that data is all going to be consolidated by the LLMs and gatekeepers now providing easy answers, with no public visibility into the hard work and discussion that leads to those answers. And when that content disappears, I guess we're back to square one.

‍

May 23, 2026

Chris Kenton

Return to Blog
‍