This company may harm your Internet
Google and
its "safe browsing" database<
February 3, 2009
About 1
.2 percent of all scrapes of Google's results done by
Scroogle show at least one "safe browsing" interception
by Google. This is consistent with the 1
.3 percent figure in
Google's own report, "All Your iFRAMEs Point to Us," dated
February 4, 2008.
The way that Google handles these interceptions is by prefacing
the link in their search results with
www.google.com/interstitial,
which sends the searcher to Google's page for more information.
On Google's results page itself, it identifies such listings with
the words, "This site may harm your computer."
Scroogle has always respected this format for such links, and now we
also show "
Google intercepts this link"
next to these search results.
We could have ignored these links and stripped the interstitial
from the URL, on the grounds that there is not a high correlation
between Google's "safe browsing" database and other similar
databases. It really depends on how Google chooses to define the
word "unsafe," just as the ranking order of any search engine
depends on how it defines the word "relevant." There are too many
variables between the user's browser and how it interacts with the
web, and the numerous techniques available to webmasters,
advertisers, and con artists.
How good is Google's quality control with this database? On January
31, 2009 Google accidentally labeled every single link as "harmful"
for nearly an hour. That raised a lot of eyebrows. Later we saw an
item at Stopbadware.org
that was posted by manager Maxim Weinstein on January 20, 2009. He
said that Google was now reporting 183,000 badware sites, whereas
it was 145,000 "a couple months ago." He had no idea why
:
"Google has been known to tweak its systems, sometimes leading to a
significant increase or decrease of reported hosts without any
change in external conditions."
For anyone like us who has had close experience with Google's
search-engine rankings since 2001 or so, it's no secret that
massive fluctuations can occur even when everything external to
Google is fairly stable. Now we know that this is also true of the
"safe browsing" database.
If Google marks a site as one that "may harm your computer," at
least you can get details on the reasons behind their analysis at
www.google.com/safebrowsing/diagnostic?site=www.example.com
(substitute www.example.com with the URL of the target site). And
Google, in conjunction with Stopbadware.org, also has a procedure
that allows webmasters to appeal and get their site unlisted.
That's the good news.
The bad news is that the quality control issue is bothersome.
Google also seems a bit stingy with its database. It's easy to find
out about a site if you already have one in mind, but as far as we
can determine, it's not easy to get a useful random sample of
Google's current listing of malware sites. This makes it difficult
for independent researchers to evaluate Google's quality control.
Our guess is that Google doesn't want competitors to grab major
chunks of its "safe browsing" database. Someday it might be
profitable to license access to this data to other companies.
On an issue as important as malware, the public interest demands
that this information be shared openly, and evaluated openly by
independent researchers. If Google disagrees with this notion, then
it might be better if they discontinue their research. Their
motives are suspicious. Why should they bother with all this
research on harmful sites, when adding the option to disable
JavaScript in the Chrome browser — an option that's always
been in all other browsers — would probably do more for safe
browsing in the long run than their entire database will ever do?
For these reasons, Scroogle is recording the first instance of an
interstitial URL on every search that produces one, and is making
this data available for download. We are doing this in small chunks
of one hour at a time. If you want to build up a list for your own
use, you can
download
this page once per hour around the clock. At the same minute of
each hour our latest list overwrites the previous list, and we do
not archive old lists. This keeps our data current, and it's also
the easiest way for us to do this. Anyone who is experienced enough
to handle serious malware research will find it trivial to automate
these small downloads and combine, sort, purge duplicates, and parse
the URLs into something useful.
Google-Watch home
Scroogle home