Last week I re-activated my old advertisement system on one of our company high traffic sites. And after one day I understood that number of views and clicks are not normal, I mean they were 10 times higher than I expected.
When I checked the site raw logs, unexpectedly I found about 80% of the site traffic belonged to search engine robots and crawlers!!! Not only famous American ones like Google, Bing and Yahoo but also Russian Yandex, Chinese Baidu and Iranian Yooz were very active on the site.
There were two solution to solving the problem:
1. Blocking search engines traffic and preventing their robots and crawlers to indexing my site, an expert knows this action means suicide !!!
2. Detecting search engines page views and omitting them from my statistics.
There is no wonder that I chose the second way, I created a dictionary and added all known search engines HTTP headers to it. When a visitor (robot or human) was visiting a page an script was checking the dictionary to detect it should be counted or not.
This process solved 80 to 95 percent of my problem, but I understood that there are many crawlers that their HTTP headers is like a personal computer or tablet and there is no specific phrase on their headers to detect them as robot or crawler. The solution was checking and detecting their IPs, if they were from known search engine. again omiting them from pages views and clicks statistics, but I have had a problem that still exists:
The site use cloudflare.com service for better performance, and unfortunately the IPs in raw log belongs to this site not real visitors.
If you have any experience about this problem and probably have a solution for it, I will appreciate it if you share your experience on comment part or contact me directly.