Bypassing Cloudflare wall for my little scrapper

Background

Just as I fixed one problem and got something to work, another piece of my system is breaking. This time, my webscrapper is being challenged by Cloudflare bot checking.

Right from the get go, during the initial development I want to host the script using my existing cloud hosting provider. Because of the nature of the shared hosting service, running the script there is impossible due to the sheer amount of activity (not by me, but others users) using the service. Thus, the IP has been blocked by my target site right from the start.

PS: There are other service that offer such service to bypass this kind of bot check. Sticking to my Frugal Engineering way of work, I want to keep it cheap without resorting to external service.

In the end, I decided to repurpose the script to run weekly, on my own machine. By using the residential IP, it helps to bypass the IP restriction, and have helped me to build an impressive database for my stock research.

Fetch error

Things changed a few week ago. I notice that my script is failing. I was using a simple fetch and the reponse I am getting was “Just a moment”, as though the DOM is stuck from being able to load.

That’s where I noticed that it might be caused by some bot check. Sorry, no screenshot available, but should be something like this behind the scene.

Bypass the check

I was at lost initially, knowing that to overcome the bot check is not easy. So glad that we’re now operating in the era of AI, where I have a coding partner to tackle the problem together.

I was trying to use the native PHP without any dependency to keep the script lightweight. In the end, I wasn’t able to one-shot the problem with AI, with a lot of trial and error and testing. In the end, due to the nature of me running the script manually, the solution for my case was rather simple. I was over-complicating with AI, haha!

As grand as I am making this sound, the solution was rather simple, that is to use a headless chromium browser to read the page.

// Import Browser classes
use HeadlessChromium\BrowserFactory;
use HeadlessChromium\Browser;

// Initialize Browser Factory and Browser instance once
$browserFactory = new BrowserFactory(findBrowserPath()); // Assuming findBrowserPath() is available globally or defined
$browser = $browserFactory->createBrowser([
    'headless' => true,
    'windowSize' => [1920, 1080],
    'enableImages' => false,
    'customFlags' => [
        ...
    ],
    'userAgent' => '...'
]);

How it works, is because the browser share the same executable on my machine, and with this I am able to make the initial load to bypass the check. And then continue back with the script to proceed with the data extraction.

With AI they also implemented function that I never thought I would have added. Like findBrowserPath() logic. I also used this chance to upgrade the css-selector to make it more robust when reading the page. It wasn’t one of my strength, but managed to get it to work.

Conclusion

I original estimation without AI, would take me days to spike and test for solutions. Glad that, I managed to shorten this within a day. And a free solution that works for my simplified scrapping system architecture.

Also, a lesson is that implementing Cloudflare in front of any service is pretty easy to setup, and I am surprise that they did not implemented it sooner. Just a small steps can helps to prevent or at least make our live difficult to use the data.

debugging

choong pw

eat to survive, code to dream