Dec 20, 2009

Iframes, Please Make Way for SEO Poisoning

If a hacker managed to hack into your blog or website, what could they possibly do? They could insert malicious iframes or JavaScript code into your Web pages. Probably even attempt to steal some data. But most likely they would "search engine optimize" your website. Can this be true? Well, let me explain more.

Search engine optimization (SEO) is a collection of techniques used to achieve higher search rankings for a given website. "Black hat SEO" is the method of using unethical SEO techniques in order to obtain a higher search ranking. These techniques include things like keyword stuffing, cloaking, and link farming, which are used to "game" the search engine algorithms.

So what does a hacker gain from all this? Why would a hacker help you achieve a higher search engine ranking? Quite the contrary; he is helping himself.

What the hacker actually does is add numerous additional Web pages to your website. Let’s call each of these additional pages "fake" Web pages. Each fake page is based on a popular search topic and has content related to that topic. Most often the content is stolen from legitimate sites and feeds. The hacker also uses topic-related keywords in the URLs of these fake pages. Each of these fake Web pages is added without the website owner’s knowledge or consent. For example, if you own the site example.com, the hacker might add virtual pages such as:

• example.com/?ohio-voting-results
• example.com/?atlanta-mayoral-race-results
• example.com/?dancing-with-the-stars
• example.com/?nicole-narain

All of these fake pages would then redirect to content stolen from some reputable site related to the keyword.

SEO Updates

Now, if a legitimate user was to search for one of these keywords, he or she would encounter a reference to this fake Web page in the search engine results. The keywords in the URL, the keywords in the title, and the relevant content would all cause some search engine algorithms to place this Web page high in the search engine results. In other words, this fake Web page has "gamed" the search engine algorithm into believing that that it has relevant content, with respect to what is being searched for.

SEO Updates

But what does the hacker gain from getting a legitimate user to visit this fake Web page? After all, the visitor would simply be reading relevant information based on what was searched for. Well, not really. The Web server’s configuration file is changed by the hacker to recognize that the user is visiting this fake page after following a link from a search engine result page, and is then redirected to a fake antivirus or misleading application Web page, which is different from what the search engine spider actually sees. This is known as cloaking.

SEO

Cloaking is a black hat SEO technique in which the content presented to the search engine spider is different to that presented to the user’s browser. Search-engine crawlers spider through links in order to find and index Web pages. So when the search engine spider visits this page, it is presented with relevant information that is related to the search topic theme. In fact, most often the relevant keywords in the URL, title, and the content gives this fake Web page a higher ranking in the search engine results. However, when the user visits this fake page from the search engine result page he or she will be redirected to a fake scan Web page.

There are many different ways to achieve cloaking. One popular method is to look at the User-Agent string in the HTTP request. Search engine crawlers use specific strings in the user-agent field of the HTTP header. Using this, the Web server can serve a different page to the crawler. The referrer field can be used to ascertain if the user is coming from a search engine result page, and in that case redirect them to a fake scan website serving misleading applications.

SEO

So how do these websites get picked up by the search engine crawlers? There are several ways to do this. One can manually submit a website to search engine crawlers. Also, crawlers can spider through links, so a reference link on one website can get your website crawled and indexed. Additionally, many crawlers use sitemaps provided by the website owners in order to find all the pages on the website. Search engine advertising programs can also be used for getting indexed.

Search engines attribute importance to links to a website that exist on other websites. These links are called "backlinks" and indicate the popularity of a website. Backlinks will also get a website crawled and indexed as well as increase the page rank.

A "link farm" is a group of websites that have links to other websites in that group. Apart from other factors that contribute to a website gaining a good ranking, backlinks play a vital role—and a link farm provides a website with many back links. In fact, there are services such as Link Farm Evolution and SENuke available online, which allow for the creation of thousands of backlinks for a website.

Recently we came across a link farm for a group of fake pages that were serving up misleading applications. The link farm allows these fake pages to be indexed and therefore increases their page rankings.

SEO

Shown above is a snapshot of the link farm in question. You can see that each link is related to a recent real-world event, and each link ends with a keyword related to that event. All of these pages were created on legitimate websites that were hacked to serve these virtual pages.

Although the

    tag would prevent these links from being visible to the user, they are still visible to search engines. In addition, a normal user may never see these links, even in the HTML source, because this code is only served if the request was made by a search engine crawler. Shown below is a screenshot of another similar campaign:

    SEO

    So, you have now read how black hat SEO techniques are effectively employed to redirect victims to fake antivirus websites from search engine results. The following diagram gives a good visual summary of the typical actions that occur:

    SEO updates


    1. The hacker hacks a site to serve legitimate content to a search spider and malicious content to users.
    2. The hacker creates a link farm to the hacked site to be picked up by a search spider.
    3. A search spider crawls the link farm.
    4. The hacked site appears in the search results.
    5. A user clicks on the search result link leading to the hacked site, which redirects to the malicious page.

Canonical URL Tag

The announcement from Yahoo!, Live & Google that they will be supporting a new "canonical url tag" to help webmasters and site owners eliminate self-created duplicate content in the index is, in my opinion, the biggest change to SEO best practices since the emergence of Sitemaps. It's rare that we cover search engine announcements or "news items" here on SEOmoz, as this blog is devoted more towards tactics than breaking headlines, but this certainly demands attention and requires quick education.

To help new and experienced SEOs better understand this tag, I've created the following Q+A (please feel free to print, email & share with developers, webmasters and others who need to quickly ramp up on this issue):

How Does it Operate?

The tag is part of the HTML header on a web page, the same section you'd find the Title attribute and Meta Description tag. In fact, this tag isn't new, but like nofollow, simply uses a new rel parameter. For example:

link rel="canonical" href="http://www.seomoz.org/blog"

This would tell Yahoo!, Live & Google that the page in question should be treated as though it were a copy of the URL www.seomoz.org/blog and that all of the link & content metrics the engines apply should technically flow back to that URL.

Canonical URL Tag

The Canonical URL tag attribute is similar in many ways to a 301 redirect from an SEO perspective. In essence, you're telling the engines that multiple pages should be considered as one (which a 301 does), without actually redirecting visitors to the new URL (often saving your dev staff considerable heartache). There are some differences, though:

  • Whereas a 301 redirect re-points all traffic (bots and human visitors), the Canonical URL tag is just for engines, meaning you can still separately track visitors to the unique URL versions.
  • A 301 is a much stronger signal that multiple pages have a single, canonical source. While the engines are certainly planning to support this new tag and trust the intent of site owners, there will be limitations. Content analysis and other algorithmic metrics will be applied to ensure that a site owner hasn't mistakenly or manipulatively applied the tag, and we certainly expect to see mistaken use of the tag, resulting in the engines maintaining those separate URLs in their indices (meaning site owners would experience the same problems noted below).
  • 301s carry cross-domain functionality, meaning you can redirect a page at domain1.com to domain2.com and carry over those search engine metrics. This is NOT THE CASE with the Canonical URL tag, which operates exclusively on a single root domain (it will carry over across subfolders and subdomains).

Over time, I expect we'll see more differences, but since this tag is so new, it will be several months before SEOs have amassed good evidence about how this tag's application operates. Previous rollouts like nofollow, sitemaps and webmaster tools platforms have all had modifications in their implementation after launch, and there's no reason to doubt that this will, too.

How, When & Where Should SEOs Use This Tag?

In the past, many sites have encountered issues with multiple versions of the same content on different URLs. This creates three big problems:

  1. Search engines don't know which version(s) to include/exclude from their indices
  2. Search engines don't know whether to direct the link metrics (trust, authority, anchor text, link juice, etc.) to one page, or keep it separated between multiple versions
  3. Search engines don't know which version(s) to rank for query results

When this happens, site owners suffer rankings and traffic losses and engines suffer lowered relevancy. Thus, in order to fix these problems, we, as SEOs and webmasters, can start applying the new Canonical URL tag whenever any of the following scenarios arise:

Canonical URL Issues for Categories

Canonical URLs for Print Versions

Canonical URLs for Session IDs

While these examples above represent some common applications, there are certainly others, and in many cases, they'll be very unique to each site. Talk with your internal SEOs or SEO consultants to help determine whether, how & where to apply this tag.

What Information Have the Engines Provided About the Canonical URL Tag?

Quite a bit, actually. Check out a few important quotes from Google:

Is rel="canonical" a hint or a directive?
It's a hint that we honor strongly. We'll take your preference into account, in conjunction with other signals, when calculating the most relevant page to display in search results.

Can I use a relative path to specify the canonical, such as ?
Yes, relative paths are recognized as expected with the tag. Also, if you include a link in your document, relative paths will resolve according to the base URL.

Is it okay if the canonical is not an exact duplicate of the content?
We allow slight differences, e.g., in the sort order of a table of products. We also recognize that we may crawl the canonical and the duplicate pages at different points in time, so we may occasionally see different versions of your content. All of that is okay with us.

What if the rel="canonical" returns a 404?
We'll continue to index your content and use a heuristic to find a canonical, but we recommend that you specify existent URLs as canonicals.

What if the rel="canonical" hasn't yet been indexed?
Like all public content on the web, we strive to discover and crawl a designated canonical URL quickly. As soon as we index it, we'll immediately reconsider the rel="canonical" hint.

Can rel="canonical" be a redirect?
Yes, you can specify a URL that redirects as a canonical URL. Google will then process the redirect as usual and try to index it.

What if I have contradictory rel="canonical" designations?
Our algorithm is lenient: We can follow canonical chains, but we strongly recommend that you update links to point to a single canonical page to ensure optimal canonicalization results.

from Yahoo!:

• The URL paths in the tag can be absolute or relative, though we recommend using absolute paths to avoid any chance of errors.

• A tag can only point to a canonical URL form within the same domain and not across domains. For example, a tag on http://test.example.com can point to a URL on http://www.example.com but not on http://yahoo.com or any other domain.

• The tag will be treated similarly to a 301 redirect, in terms of transferring link references and other effects to the canonical form of the page.

• We will use the tag information as provided, but we’ll also use algorithmic mechanisms to avoid situations where we think the tag was not used as intended. For example, if the canonical form is non-existent, returns an error or a 404, or if the content on the source and target was substantially distinct and unique, the canonical link may be considered erroneous and deferred.

• The tag is transitive. That is, if URL A marks B as canonical, and B marks C as canonical, we’ll treat C as canonical for both A and B, though we will break infinite chains and other issues.

and from Live/MSN:

  • This tag will be interpreted as a hint by Live Search, not as a command. We'll evaluate this in the context of all the other information we know about the website and try and make the best determination of the canonical URL. This will help us handle any potential implementation errors or abuse of this tag.
  • You can use relative or absolute URLs in the “href” attribute of the link tag.
  • The page and the URL in the “href” attribute must be on the same domain. For example, if the page is found on “http://mysite.com/default.aspx”, and the ”href” attribute in the link tag points to “http://mysite2.com”, the tag will be invalid and ignored.
    • However, the “href” attribute can point to a different subdomain. For example, if the page is found on “http://mysite.com/default.aspx” and the “href” attribute in the link tag points to “http://www.mysite.com”, the tag will be considered valid.
  • Live Search expects to implement support for this feature sometime in the near future.

What Questions Still Linger?

A few things remain somewhat murky around the Canonical URL tag's features and results. These include:

  • The degree to which the tag will be trusted by the various engines - will it only work if the content is 100% duplicate 100% of the time? Is there some flexibility on the content differences? How much?
  • Will this pass 100% of the link juice from a given page to another? More or less than a 301 redirect does now? Note that Google's official representative from the web spam team, Matt Cutts, said today that it passes link juice akin to a 301 redirect but also noted (when SEOmoz's own Gillian Muessig asked specifically) that "it loses no more juice than a 301," which suggests that there is some fractional loss when either of these are applied.
  • The extent of the tag's application on non-English language versions of the engines. Will different levels of content/duplicate analysis and country/language-specific issues apply?
  • Will the engines all treat this in precisely the same fashion? This seems unlikely, as they'd need to share content/link analysis algorithms to do that. Expect anecdotal (and possibly statistical) data in the future suggesting that there are disparities in interpretation between the engines.
  • Yahoo! strongly recommends using absolute paths for this (and, although we've yet to implement it, SEOmoz does as well, based on potential pitfalls with relative URLs), but the other engines are more agnostic - we'll see what the standard recommendations become.
  • Yahoo! also mentions the properties are transitive (which is great news for anyone who's had to do multiple URL re-architectures over time), but it's not clear if the other engines support this?
  • Live/MSN appears to have not yet implemented support for the tag, so we'll see when they formally begin adoption.
  • Are the engines OK with SEOs applying this for affiliate links to help re-route link juice? We'd heard at SMX East from a panel of engineers that using 301s for this was OK, so I'm assuming it is, but many SEOs are still skeptical as to whether the engines consider affiliate links as natural or not.

LIVE SEO NEWS © 2008.

TOPO