Google: A Spammer’s Best Friend
I rarely discuss email spam on this blog. It is off-topic and already too often confused with search engine spam. Nevertheless, as an avid Google user (they pay my bills, I pay theirs), I believe that their search technologies wrongfully empower email spammers in ways that are easily circumvented. I have listed below how Google is a spammer’s best friend.
(1) Masking Spammer’s URLs
Google’s URL redirection method allows for spammers to easily mask their urls to appear as if they are from Google. For example, the url below appears to be hosted on Google, although it redirects back to TheGoogleCache
Spammers are already using this to mask redirects to pharma and adult websites. More industrious spammers will soon begin to mask the redirect to point to subdomains such as http://www.google.com.spammerssite.com/?q=whatever that look and feel exactly like Google. A phisher’s dream to gain access to Adwords, Adsense, Gmail, or other accounts. This gaping hole could be easily addressed either by associating random expiring tickets with each redirect (so that only redirects with a valid ticket generated on Google.com would be redirected).
(2) Finding Email Addresses
Currently Google is the only of the major 3 (Google, Yahoo, and MSN) to support the asterisk in queries. ie: “word*word” patterns are matched. This has made finding pages filled with large numbers of email addresses fairly easy, for example… “mailto*gmail.com” “mailto*hotmail.com” “mailto*yahoo.com” “mailto*com” unsubscribe inurl:archive Google, however, has only taken a few actions to prevent obviously nefarious queries to be rejected. “Powered by PHPBB” There is no positive outcome, to my knowledge, that could come from such queries.
(3) Anonymous Scraping
Google’s choice to cache anything and everything makes scraping particularly easy and anonymous. Scrapers can download copies of the pages directly from Google’s cache, leaving the site-owner with no record of the IP that accessed a list. And, if a list was mistakenly added, Google’s “assume cache” would leave the data available to spammers as a supplemental result for months after the site had removed the page. If Google’s robots changed their assumptions from Index,Cache to Index,Nocache, scrapers would be forced to go directly to the host’s site to retrieve the data, leaving their trace across Internet and vulnerable to distributed defenses which would function similarly to Akismet or Linksleeve – or could even occur on a DNS level.
(4) Spamvertising
Google’s keyword insertion tool for AdWords has already received heat for allowing child porn advertisers to find their way to Google’s search results. The same is the case for spam words. Finding email lists – targeted, untargeted, opt-in or otherwise – in Google’s sponsored ads section is simple. Just search. My personal favorite is… email list for spamming which returns a bevy of results whose lists carry the emails of hundreds of thousands of individuals who I am sure would not like their email addresses bought and sold.
Conculsions
I believe that innovation is always coupled with responsibility. Google has the greatest informational power in the world, and has given us fantastic tools to harness that information to many ends. I believe that in doing this, the search giant must take responsibility to offer its due diligence in making sure those ends are not evil. Isn’t that what “Do No Evil” is about in the long run?
No tags for this post.
In relation to Google’s cache being exploited by email spammers, you suggest that “This gaping hole could be easily addressed either by associating random expiring tickets with each redirect (so that only redirects with a valid ticket generated on Google.com would be redirected).”
There is a problem with this. A lot of sites point to Google caches of information that is no longer available for whatever reason. This includes ‘gotchas’ where a blog might point to an error on a site, and refer to the Google cache to prevent coverups.
Your proposal might stand in the way of this.