Едит мой знакомый спокойно себе по дороге, никого не трогает..Вдруг..Бах! Удар в зад...
Машина в кювете..Знакомый пытается дверь открыть..Подошла девушка из той машины сзади и говорит -"ТЫ живой?", Тот ответил -"вроде да"...девушка села и уехала...
И я с удовольствием стал разыскивать знакомые фамилии из записной книжки. Обнаружил массу общих знакомых, много сообществ, названия которых я и вспомнить-то не мог. И вот, откинувшись в очередной раз в кресле, чтобы глотнуть чайку, я посмотрел на экран и вдруг увидел....
ДОСЬЕ на себя
Как защитить свой уникальный контент от воровства. Доказать Google, что твой контент первоисточник.
How to Defend your Website from the Google Duplicate Proxy Exploit
By Sophie White (c) 2007
There is a current and active way to knock a website out of
Google's search engine results. It's simple and effective. This
information is already in the public domain and the more people
that know about it, the more likelihood there is that Google
will do something about it. This article will tell you how it
works, how to get a website knocked out of the search engine
rankings, but most importantly, how to defend your own website
from having it happen to you.
To understand this exploit, you must first understand about
Google's Duplicate Content filter. It's simply described thus:
Google doesn't want you to search for "blue widget" and have the
top 10 search terms returned copies of the same article on how
great blue widgets are. They want to give you ONE copy of the
Great Blue Widget article, and 9 other different results, just
on the off chance that you've already read that article and the
other results are actually what you wanted.
To handle this, every time Google spiders and indexes a page, it
checks it to see if it's already got a page that is predominantly
the same, a duplicate page if you will. Exactly how Google works
this out, nobody knows exactly, but it is going to be a
combination of some or all of: page text length, page title,
headings, keyword densities, checking exactly copy sentence
fragments etc. As a result of this duplicate content filter,
a whole industry has grown up around trying to get round the
filter. Just search for "spin article".
Getting back to the story here, Google indexes a page and lets
say it fails it's duplicate content check, what does Google do?
These days, it dumps that duplicate page in Google's Supplemental
Index. What, you didn't know that Google has 2 indexes? Well
they do: the main one, and a supplemental one. Two things are
important here: Google will always return results from their
Main index if they can; and they will only go to the Supplemental
index if they don't get enough joy from their main index. What
this means is that if your page is in the supplemental index,
it's almost certain that you will never show up in the Search
Engine Ranking Pages, unless there is next to no competition for
the phrase that was searched for.
This all seems pretty reasonable to me, so what's the problem?
Well there's another little step I haven't mentioned yet. What
happens if someone copies your page, let's say your homepage of
your business website, and when Google indexes that copy, it
correctly determines that it's a duplicate. Now Google knows
about 2 pages that it knows are duplicates, it has to decide
which to dump in the supplemental index, and which to keep in
the main one. That's pretty obvious right? But how does Google
know which is the original and which is the copy? They don't.
Sure they have some clever algorithms to work it out, but even
if they are 99% accurate, that leaves a lot of problems for that
1% of times they can get it wrong!
And this is the heart of the exploit, if someone copies your
website's homepage say, and manages to convince Google that
*their* page is the original, your homepage will get tossed into
the supplemental index, never to see the light of day in the
Search Engine Ranking Pages again. In case I'm not being clear
enough, that's bad! But wait, it gets worse:
It's fair to say that in the case of a person physically copying
your page and hosting it, you can often get them to take it down
through the use of copyright lawyers, and cease and desist
letters to ISP's and the like, with a quick "Reinclusion Request"
to Google. But recently there's a new threat that's a whole lot
harder to stop: the use of publicly accessible Proxy websites.
(If you don't know what a Proxy is, it's basically a way of
making the web run faster by caching content more local to your
internet destination. In principle, they are generally a good
thing.)
There are many such web proxies out there, and I won't list any
here, however I will describe the process: they send out spiders
(much like Google's) and they spider your page, take your
content, then they host a copy of your website on their proxy
site, nominally so that when their users request your page, they
can serve up their local copy quickly rather than having to
retrieve if off your server. The big issue is that Google can
sometimes decide that the proxy copy of your web page is the
original, and yours is not.
Worse again, there's some evidence that people are deliberately
and maliciously using proxy servers to cache copies of web
pages, then using normal (white and black hat) Search Engine
Optimization (SEO) techniques to make those proxy pages rank in
the search engine, increasing the likelihood that your legitimate
page will be the one dumped by the search engines' duplicate
content filters. Danger Will Robinson!
Even worse still, some of the proxy spiders actively spoof
their origins so that you don't realise that it's a spider from
a proxy, as they pretend to be a Googlebot for example, or from
Yahoo. This is why the major search engines actively publish
guidelines on how to identify and validate their own spiders.
Now for the big question, how can you defend against this?
There are several possible solutions, depending on your web
hosting technology and technical competence:
Option 1 - If you are running Apache and PHP on your
server, you can set the webhost up to check for search
engine spiders that purport to be from the main search
engines, and using php and the .htaccess file, you can
block proxies from other sources. However this only works
for proxies that are playing by the rules and identifying
themselves correctly.
Option 2 - If you are using MS Windows and IIS on your
server, or if you are on a shared hosting solution that
doesn't give you the ability to do anything clever, it's an
awful lot harder and you should take the advice of a
professional on how to defend yourself from this kind of
attack.
Option 3 - This is currently the best solution available, and
applies if you are running a PHP or ASP based website: you
set ALL pages robot meta tags to noindex and nofollow, then
you implement a PHP or ASP script on each page that checks
for valid spiders from the major search engines, and if so,
resets the robot meta tags to index and follow. The
important distinction here is that it's easier to validate
a real spider, and to discount a spider that's trying to
spoof you, because the major search engines publish
processes and procedures to do this, including IP lookups
and the like.
So, stay aware, stay knowledgeable, and stay protected.
And if you see that you've suddenly been dumped from the
Search Engine Rankings Pages, now you might know why, how
and what to do about it.