I am curious if search engine spiders are able to access content that is protected with aMember. It appears that the systems works on session variables (just a guess) and I was hoping that someone could provide me with a way of allowing the search engine spiders to index those pages. We have protected content that we want to protect, but also still be indexed. Thank you for any help that anyone can provide. Thanks.
It is possible to allow access by user-agent. There are the following lines in amember/plugins/protect/php_include/check.inc.php if (preg_match('/^googlebot/i', $_SERVER['HTTP_USER_AGENT'])) return; You may uncomment it and google will be allowed to open your pages without login.
Excellent! Thanks for another speedy reply. Hopefully these are questions that other people might have in the future.
Other agents I am assuming that I can copy and paste that code and change the user-agent to allow other robots. Would you mind showing examples for Yahoo, MSN, and any other popular ones. That would help me out a lot. Thanks.
Spiders are NOT given access... I did some testing on this, and it does not appear to work properly. Using Firefox, I would change my user-agent to see what would happen when trying to access protected content. By uncommenting the line that is included in the script, which checks for a match for "googlebot", then using "googlebot" as a user-agent, I receive the following error: Code: Warning: Invalid argument supplied for foreach() in /home/practica/public_html/membership/plugins/protect/new_rewrite/new_rewrite.inc.php on line 12 Warning: Cannot modify header information - headers already sent by (output started at /home/practica/public_html/membership/plugins/protect/new_rewrite/new_rewrite.inc.php:12) in /home/practica/public_html/membership/plugins/protect/new_rewrite/login.php on line 37 I then put in another PREG_MATCH to check for "Slurp". When I put my user-agent to "Slurp" and try to access protected content I get an error from thr browser that it is stuck in a re-direct loop. Is there something that I can do? It seems to work fine for protection, but I absolutely NEED the spiders to be able to access that protected content. From what I can tell, it looks like it is trying to get SESSION information even when the user-agent is identified as a spider. Then again, I don't know the aMember system that well. Thank you for you help, I really need to get this working.
I got it.... In case it helps anyone, it wasn't working for me because I am using mod_rewrite to also hide dynamic URLs. The fix was to move the conditional code that checks for spider user-agents below the rewrite rule needed to hide the URLs. Then it worked fine. Using Firefox and changing the user-agent I am able to access protected content as a spider.
Doesn't this present one huge great big security problem? Anyone wanting to see the protected content on your site can simply switch the user agent in their browser and pretend to be a google spider. Does anyone else have a problem with this? Steve
I haven't seen a problem You are right, there is a risk. For us, the value of having our protected content spidered by search engines is more important than the small amount of people that might figure out what to set their user-agent to in order to gain access. If you really wanted to lock it down, you could modify the rewrite rule to be conditional for the IP address that is associated with each spider. I am not sure how to do this, but it would be possible to then associate the user-agent with the correct IP that each spider uses. I have yet to see any abuse based on this yet, but I will post if it becomes an issue.
Sorry for pulling up an old thread. Let say you enable this and let google view your member site area. Wouldn't the cache on Google give users access to that content? I'm considering doing this.
Are you using a CMS system to handle your content or are you protecting basic html? In either case, I think you should only give the spiders "teaser text". That is a paragraph summarizing the content. David
I have uncommented the code as mentioned above for googlebot. I am mainly targeting google, yahoo, and msn. After doing some research, I beleive the yahoo and msn crawlers names are yahoo! slurp and msnbot... So would I add this code below the googlebot code I just uncommented? if (preg_match('/^yahoo! slurp/i', $_SERVER['HTTP_USER_AGENT'])) return; if (preg_match('/^msnbot/i', $_SERVER['HTTP_USER_AGENT'])) return;
Just to be clear, showing different text for search bots than to users is considered a bad thing and could lead to your site being removed from the search engine index.
After doing some more diggin around here on this bulletin, I realized that the solution in this thread won't work for me because I am using the "new_rewrite" method of protection. So I've added this code to the .htaccess file in each of my protected directories. If anyone knows if I'm doing this right, I would appreciate the help... Under the code that says "RewriteEngine On" I put this snippet... #allow access for Google AdSense RewriteCond %{http_user_agent} ^Mediapartners-Google RewriteRule ^(.*)$ - [L] #allow access for Google RewriteCond %{http_user_agent} ^googlebot RewriteRule ^(.*)$ - [L] #allow access for Yahoo RewriteCond %{http_user_agent} ^yahoo! slurp RewriteRule ^(.*)$ - [L] #allow access for MSN RewriteCond %{http_user_agent} ^msnbot RewriteRule ^(.*)$ - [L]
OK that broke my website. I started getting this error... Internal Server Error The server encountered an internal error or misconfiguration and was unable to complete your request.