Last week I re-activated my old advertisement system on one of our company’s high traffic sites. And after one day I understood that the number of views and clicks are not reasonable, I mean they were ten times higher than I expected.
When I checked the raw site logs, unexpectedly, I found about 80% of the site traffic belonged to search engine robots and crawlers!!! Not only famous American ones like Google, Bing, and Yahoo but also Russian Yandex, Chinese Baidu, and Iranian Yooz were very active on the site.
There were two solutions for solving the problem:
1. Blocking search engine traffic and preventing their robots and crawlers to indexing my site, an expert knows this action means suicide !!!
2. Detecting search engines page views and omitting them from my statistics.
There is no wonder that I chose the second way, I created a dictionary and added all known search engines HTTP headers to it. When a visitor (robot or human) was visiting a page, a script was checking the dictionary to detect it should be counted or not.
This process solved 80 to 95 percent of my problem, but I understood that there are many crawlers that their HTTP headers are like a personal computer or tablet, and there is no specific phrase on their headers to detect them as robots or crawler. The solution was checking and identifying their IPs if they were from a known search engine. Again omitting them from pages views and clicks statistics, but I have had a problem that still exists:
The site uses cloudflare.com service for better performance, and unfortunately, the IPs in raw log belong to this site, not real visitors.
If you have any experience with this problem and probably have a solution for it, I will appreciate it if you share your experience in the comment part or contact me directly.