Today is my Birthday! Yes somehow I made it through another year, and I wanted to celebrate it with you by talking about robots.txt files.
No Wait, you didn’t actually think I was sitting here typing this for you this morning did you? YECK No, I’m taking the day off. What’s the point of being your own boss if you can’t ditch work once in awhile? Who am I going to get in trouble with?
Actually some really close friends flew into town yesterday to celebrate with me, and boy we are going to party tonight! We always have a great time when we get together and the question of the night usually is just how much trouble can we get into by 2 AM. And I need this right now, I’m really depressed about a personal matter, so a night of partying where anything, (and I mean ANYTHING) goes with just about anyone is exactly what I need right now!
BUT for the rest of you get back to work, there are things to be done. How’s your robots.txt file working for you these days? Do you know what it is or even know if you have one? It’s a handy little you guess it, text file which you place in the root directory of your website. It uses the robots exclusion protocol which is a standard that allows us to instruct search engine spiders, crawlers and other robots which parts of our site we do NOT want them to crawl. You see, by default and without doing anything, what you put on the web will eventually be found and indexed, but what if you don’t want it to be?
Let’s think about that for a minute, why would anyone put something on the web and NOT want it to be found. Sounds a bit counterintuitive right? Well if you’re following this blog then I have to assume you are an internet marketer right? And if you’re an internet marketer then you probably have some free giveaway for your opt-in list hidden somewhere, or maybe you’re just starting out and don’t have the money to setup a proper and secure sales funnel and member area so you shoved your very first WSO into a directory on your website. Well, you certainly don’t want these two directories indexed and out there for the online world to see, right? Why? Because people would get your freebie without providing you their email address, or get your WSO without paying you for it.
Now I don’t want you to be confused here. A robots.txt file is a guide only, and most good little bots that are able to read it will try their best to follow your wishes. That means if you tell Googlebot to keep its inquisitive little nose out of your directory called /universekeys/, it will and any subdirectories which it contains as well. But let’s face it, there are many robots out there, and A LOT are not on our good guys side.
There are many more dangerous and malicious robots out there right now roaming your site and completely ignoring your robots.txt instructions. So if you are using this file as your only means of security then I have to insert a HUGE uh-oh here. You should be using your .htaccess file for security measures such as this, and if you’re running Apache on a UNIX based sever like 90 percent of us out there, then you have one of these files as well. I’ll do another post on that in the future with some tweaks to help protect your site from getting hacked, cause that’s no fun!
Open a notepad and save it as robots.txt use exact spelling and this is case sensitive guys, so no Cap’s please! Here is a small snippet of a robots.txt file for one of my websites that I have in the health and fitness niche.
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/
The User-agent line specifies which bot if any specifically you are ‘talking’ to. Me, I don’t discriminate so using User-agent: * means I am talking to all of them. On this particular site I want everyone to follow the same rules. Disallow: /directory/ refers to which directories you do NOT want them in if any and you can specify particular rules for particular bots as well:
The following code tells all robots to keep out of these 3 directories on my site, but tells specifically Google, Yahoo and MSN robots that it’s ok for them to go take a peek.
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ User-agent: Googlebot User-agent: Yahoo-MMCrawler User-agent: msnbot Disallow:
This is basically the same code as above but it’s using separate code for each ‘allow’ statement. In the programming world there are many ways to code which will accomplish the exact same thing and give you the same outcome; HOWEVER, the goal when coding should always be the most efficient to run and fast to deliver, which in this case is the code above.
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ User-agent: Googlebot Disallow: User-agent: Yahoo-MMCrawler Disallow: User-agent: msnbot Disallow:
Now, be careful with that trailing slash. Disallow: / means disallow EVERYTHING, Disallow: means ALLOW EVERYTHING. A couple of months ago a friend of mine came to me frustrated with this exact problem. His site was up 3 months already and nothing was indexing, he thought Panda got the better of him. I took a peek at this file on his site, and sure enough he had coded Disallow: / telling everyone to index NOTHING. I usually don’t use Disallow: to indicate include, it’s too confusing and redundant. Remember if you say nothing then you will be indexed no matter what. I only use the file for directories I don’t want anyone in; it’s a good rule of thumb for you to follow too.
Let me know your thoughts in the comments below, I’m off to party my booty off! And I just have to give a HUGE WOOP WOOP for Big Blue who won their second game this season making this JERSEY GIRL quite happy. GO GIANTS!