People who publish digital content and those who scrape it from the web don’t always see eye to eye these days, to put it mildly. However, Pierluigi Vinciguerra sees both sides of the argument firsthand. An early adopter of web scraping and author of the popular newsletter The Web Scraping Club, Pierluigi is also a cofounder and CTO of a dataset marketplace, Data Boutique.
At OxyCon 2025, the industry’s biggest annual event hosted by its leading company, Oxylabs, Pierluigi will share how web scraping and AI can come together for the benefit of the content creator. Register here if you wish to watch OxyCon online for free as it airs live on October 1st.
How did you start your journey in web scraping? And what got you interested in it?
Around 2010, Data Boutique’s co-founder, Andrea, and I started collecting data from real estate websites. Our jobs were related to internal data, but it was a pity to see so much real estate data sitting online without analyzing how the real estate market was doing, which cities were booming, and where prices were falling.
We showed the data for our first demo to a very well-known statistics company. Just the single snapshot we collected from one website was larger than their 10 years of historical time series data of the real estate market. So, we understood with this demo that web data was a game-changer for any industry.
How did The Web Scraping Club start?
Even now, you see the term web scraping around blogs, but it’s not well-understood, especially its technical difficulties. Of course, you have corporate blogs, other kinds of blogs, and some academic content, but since they are corporate blogs, they are pushing to sell their products.
There’s nothing wrong with that, but I wanted to understand from a neutral perspective how to do web scraping. Since I have already solved some of these challenges and found the tools, I wanted to share this experience with other people so they can save time and not roam around the web to find a solution.
Time saving, specifically with the help of AI tools, is also what your OxyCon presentation will be about.
I will bring two examples to the presentation. First, the content creation career I have now embraced, and second, the building of scrapers. In content creation, one of the biggest challenges is to find something new to speak about every week or for every article. My traditional workflow was to read all the blog posts and take a lot of notes. LLMs helped me speed up this process while at the same time enabling me to source my information from more sources.
As for the scraping part, when you have to scrape a large website, you don’t use LLMs because it will be slow and expensive, so you use traditional scraping. But if you have to create a lot of scrapers, maybe you want to use common rules between the scrapers, templates, and your own style of coding for each scraper. I will explain how to do this and how it helped me to save a lot of time creating a scraper, because 90% of the work is done by the LLM.
As these AI tools keep getting better at web scraping, do you think the knowledge that you taught through The Web Scraping Club will still be needed? Will the new generation still need to learn web scraping?
This is a great question, because you have two kinds of knowledge. You can share general knowledge, like what the best tools are for scraping, or how to do this or that. This kind of knowledge will probably be replaced by AI, because AI can read documentation, so you won’t need someone telling you how to use tools anymore.
But luckily for me and for the industry in general, web scraping is a complex topic. It’s not about finding the best selector for extracting the HTML you need for your tool or dataset. In an increasing number of cases, it’s about bypassing an anti-bot, finding a smart way to get the HTML, understanding how the website works, and maybe finding an internal API to be very polite and lightweight when scraping the target website. You need a lot of creativity in this industry, and I doubt AI will be able to cover this soon.
Do you think that nontechnical content creators will start using web data for their content as these tools become more accessible?
Typically, content creators are not technical people—unless they write about tech, of course—but generally speaking, content creators are not technical. For them, having this set of tools to understand their audience and improve idea generation for their blogs or channels is useful. Yes, I think it will be good for them to use web data for improving their workflows.
There’s a tension between content creators and those who scrape content. What is your view on this as a content creator yourself?
Well, it’s hard because as a content creator, I see that traffic is slightly going down. When tools like ChatGPT give you the answer, even if they publish the source, people aren’t clicking it as much. However, it’s up to the creator, in my opinion, to create content that is interesting enough and complex enough not to be chunked by LLMs.
On the other hand, it’s a complex topic. Big AI companies are aggressively scraping blogs, raising your cloud computing bill. I’ve seen that someone is proposing a solution to make AI bots pay-per-crawl, but it’s not feasible. The math doesn’t work. I’m afraid it’s something we cannot avoid unless we put everything under a paywall, and probably this is where we are headed.
But then the whole idea of the internet as a public source of knowledge dies.
I know it. This is painful for me because I’ve seen the web starting. When I was a teen, there was no internet. It seems strange if you talk about it to a teenager today. But yes, the initial idea of the internet cannot be preserved in this scenario.
Scraping e-commerce data, which is another major use case, is not that heavy for the target website. It’s just a power user of the website among the thousands of others. But today’s AI scraping rate is unsustainable for the websites and, I think, also for the AI companies.
Is there room for self-regulation in this area?
Self-regulation is great if you are among peers. When you are a pool of developers who have established a common protocol or guidelines for the web. However, now, we have a small percentage of people with tremendous power and money.
I’m not an expert in AI, of course, but I think that the competitive advantage one AI company has over another is the data, not the algorithm. The models are more or less equivalent, and the technology used for training the models is known. So, what really makes the difference is the data they use to train these models. Limits to data collection are limits to their advantage for these companies.
In that case, maybe there is a way to democratize access to data, for example, with the new AI tools enabling more people to share the benefits of web data.
That’s also what we’re trying to do at Data Boutique. It’s a marketplace for web data. Our main idea is that when you and someone else scrape a website, you get the same data. So why not create a marketplace for datasets instead of everyone scraping the same websites and making scraping unsustainable? If someone is good at scraping whatever.com and makes it cheap for everyone else, everyone wins.
What kind of trends do you expect in web scraping in 2026?
I’m expecting more and more people to use AI tools to extract data, and also more people to enter web scraping, as the initial barriers are being lowered. AI tools and no-code tools on the market are allowing nontechnical people to extract data from the web.
I expect that more people will be interested in an agentic approach to obtaining data. I mean that more people will want to create workflows or agents to extract data from a website, given certain conditions, such as agents for finding and booking the best flight option. If you want to build your own agent, you need to scrape data from various websites. I think this will be a key topic in the next few years.
The post The (Not) Doomed Internet: Can Content Creation and Scraping Join Forces? appeared first on International Business Times.