1.Crawling the Web
Search Engine Optimization run automated programs, called "bots" or "spiders", that use the
hyperlink structure of the web to "crawl" the pages and documents that make up
the World Wide Web. Estimates are that of the approximately 20 billion existing
pages, Search Engine Optimization have crawled between 8 and 10 billion.
2.Indexing Documents
Once a page has been crawled, its contents can be "indexed" - stored in a giant
database of documents that makes up a search engine's "index". This index needs
to be tightly managed so that requests which must search and sort billions of
documents can be completed in fractions of a second.
3.Processing Queries
When a request for information comes into the search engine (hundreds of
millions do each day), the engine retrieves from its index all the documents
that match the query. A match is determined if the terms or phrase is found on
the page in the manner specified by the user. For example, a search for car and
driver magazine at Google returns 8.25 million results, but a search for the
same phrase in quotes ("car and driver magazine") returns only 166 thousand
results. In the first system, commonly called "Findall" mode, Google returned
all documents which had the terms "car", "driver", and "magazine" (they ignore
the term "and" because it's not useful to narrowing the results), while in the
second search, only those pages with the exact phrase "car and driver magazine"
were returned. Other advanced operators (Google has a list of 11) can change
which results a search engine will consider a match for a given query.
4.Ranking Results
Once the search engine has determined which results are a match for the query,
the engine's algorithm (a mathematical equation commonly used for sorting) runs
calculations on each of the results to determine which is most relevant to the
given query. They sort these on the results pages in order from most relevant to
least so that users can make a choice about which to select.
Although a search engine's operations are not particularly lengthy, systems like
Google, Yahoo!, AskJeeves, and MSN are among the most complex,
processing-intensive computers in the world, managing millions of calculations
each second and funneling demands for information to an enormous group of users.
Speed Bumps & Walls
Certain types of navigation may hinder or entirely prevent Search Engine Optimization from
reaching your website's content. As search engine spiders crawl the web, they
rely on the architecture of hyperlinks to find new documents and revisit those
that may have changed. In the analogy of speed bumps and walls, complex links
and deep site structures with little unique content may serve as "bumps." Data
that cannot be accessed by spiderable links qualify as "walls."
Possible "Speed Bumps" for SE Spiders:
URLs with 2+ dynamic parameters; i.e. http://www.url.com/page.php?id=4&CK=34rr&User=%Tom%
(spiders may be reluctant to crawl complex URLs like this because they often
result in errors with non-human visitors)
Pages with more than 100 unique links to other pages on the site (spiders may
not follow each one)
Pages buried more than 3 clicks/links from the home page of a website (unless
there are many other external links pointing to the site, spiders will often
ignore deep pages)
Pages requiring a "Session ID" or Cookie to enable navigation (spiders may not
be able to retain these elements as a browser user can)
Pages that are split into "frames" can hinder crawling and cause confusion about
which pages to rank in the results.
Possible "Walls" for SE Spiders:
Pages accessible only via a select form and submit button
Pages requiring a drop down menu (HTML attribute) to access them
Documents accessible only via a search box
Documents blocked purposefully (via a robots meta tag or robots.txt file - see
more on these here)
Pages requiring a login
Pages that re-direct before showing content (Search Engine Optimization call this cloaking
or bait-and-switch and may actually ban sites that use this tactic)
The key to ensuring that a site's contents are fully crawlable is to provide
direct, HTML links to each page you want the search engine spiders to index.
Remember that if a page cannot be accessed from the home page (where most
spiders are likely to start their crawl), it is likely that it will not be
indexed by the Search Engine Optimization. A sitemap (which is discussed later in this
guide) can be of tremendous help for this purpose.
Measuring Relevance and Popularity
Modern commercial Search Engine Optimization rely on the science of information retrieval (IR).
That science has existed since the middle of the 20th century, when retrieval
systems powered computers in libraries, research facilities, and government
labs. Early in the development of search systems, IR scientists realized that
two critical components made up the majority of search functionality:
“Relevance - the degree to which the content of the documents returned in a
search matched the user's query intention and terms. The relevance of a document
increases if the terms or phrase queried by the user occurs multiple times and
shows up in the title of the work or in important headlines or subheaders.
Popularity - the relative importance, measured via citation (the act of one work
referencing another, as often occurs in academic and business documents) of a
given document that matches the user's query. The popularity of a given document
increases with every other document that references it.
These two items were translated to web search 40 years later and manifest
themselves in the form of document analysis and link analysis.
In document analysis, Search Engine Optimization look at whether the search terms are found
in important areas of the document - the title, the meta data, the heading tags,
and the body of text content. They also attempt to automatically measure the
quality of the document (through complex systems beyond the scope of this
guide).
In link analysis, Search Engine Optimization measure not only who is linking to a site or
page, but what they are saying about that page/site. They also have a good grasp
on who is affiliated with whom (through historical link data, the site's
registration records, and other sources), who is worthy of being trusted (links
from .edu and .gov pages are generally more valuable for this reason), and
contextual data about the site the page is hosted on (who links to that site,
what they say about the site, etc.).
Link and document analysis combine and overlap hundreds of factors that can be
individually measured and filtered through the search engine algorithms (the set
of instructions that tells the engines what importance to assign to each
factor). The algorithm then determines scoring for the documents and (ideally)
lists results in decreasing order of importance (rankings).
Information Search Engine Optimization Can Trust
As Search Engine Optimization index the web's link structure and page contents, they find
two distinct kinds of information about a given site or page - attributes of the
page/site itself and descriptives about that site/page from other pages. Since
the web is such a commercial place, with so many parties interested in ranking
well for particular searches, the engines have learned that they cannot always
rely on websites to be honest about their importance. Thus, the days when
artificially stuffed meta tags and keyword-rich pages dominated search results
(pre-1998) have vanished and given way to Search Engine Optimization that measure trust via
links and content.
The theory goes that if hundreds or thousands of other websites link to you,
your site must be popular, and thus, have value. If those links come from very
popular and important (and thus, trustworthy) websites, their power is
multiplied to even greater degrees. Links from sites like NYTimes.com, Yale.edu,
Whitehouse.gov, and others carry with them inherent trust that Search Engine Optimization
then use to boost your ranking position. If, on the other hand, the links that
point to you are from low-quality, interlinked sites or automated garbage
domains (aka link farms), Search Engine Optimization have systems in place to discount the
value of those links.
The most well-known system for ranking sites based on link data is the
simplistic formula developed by Google's founders - PageRank. PageRank, which
relies on a mathematical formula (based around finding a given document in a
random pattern of clicking on links), is described by Google in their technology
section:
“PageRank relies on the uniquely democratic nature of the web by using its vast
link structure as an indicator of an individual page's value. In essence, Google
interprets a link from page A to page B as a vote, by page A, for page B. But,
Google looks at more than the sheer volume of votes, or links a page receives;
it also analyzes the page that casts the vote. Votes cast by pages that are
themselves "important" weigh more heavily and help to make other pages
"important."
Google uses a PageRank “proxy” value, which logarithmically translates the
actual PageRank of a document to a value between 1 and 10, to rank Web sites
listed in its directory (which offers a PageRank order or an Alphabetical order
for listings) and in its toolbar (below).
Google's toolbar (available here) includes an icon that shows a PageRank value
from 0-10
PageRank is, in essence, a rough system for estimating the value of a given link
based on the links that point to the host page. Since PageRank's inception in
the late '90s, more subtle and sophisticated link analysis systems have taken
the place of PageRank. Thus, in the modern era of SEO, the PageRank measurement
in Google's toolbar, directory, or through sites that query the service is of
limited value. Pages with PR8 can be found ranked 20-30 positions below pages
with a PR3 or PR4. In addition, the toolbar numbers are updated only every 3-6
months by Google, making the values even less useful. Rather than focusing on
PageRank, it's important to think holistically about a link's worth.
Here's a small list of the most important factors Search Engine Optimization look at when
attempting to value a link:
The Anchor Text of Link - Anchor text describes the visible characters and words
that hyperlink to another document or location on the web. For example, in the
phrase "CNN is a good source of news, but I actually prefer the BBC's take on
events," two unique pieces of anchor text exist - "CNN" is the anchor text
pointing to http://www.cnn.com, while "the BBC's take on events" points to
http://news.bbc.co.uk. Search Engine Optimization use this text to help them determine the
subject matter of the linked-to document. In the example above, the links would
tell the search engine that when users search for "CNN", SEOmoz.org thinks that
http://www.cnn.com is a relevant site for the term "CNN" and that http://news.bbc.co.uk
is relevant to "the BBC's take on events". If hundreds or thousands of sites
think that a particular page is relevant for a given set of terms, that page can
manage to rank well even if the terms NEVER appear in the text itself (for
example, see the BBC's explanation of why Google ranks certain pages for the
term "Miserable Failure").
Global Popularity of the Site - More popular sites, as denoted by the number and
power of the links pointing to them, provide more powerful links. Thus, while a
link from SEOmoz may be a valuable vote for a site, a link from bbc.co.uk or
cnn.com carries far more weight. This is one area where PageRank (assuming it
was accurate) could be a good measure, as it's designed to calculate global
popularity.
Popularity of Site in Relevant Communities - In the example above, the weight or
power of a site's vote is based on its raw popularity across the web. As search
engines became more sophisticated and granular in their approach to link data,
they acknowledged the existence of "topical communities"; sites on the same
subject that often interlink with one another, referencing documents and
providing unique data on a particular topic. Sites in these communities provide
more value when they link to a site/page on a relevant subject rather than a
site that is largely irrelevant to their topic.
Text Directly Surrounding the Link - Search Engine Optimization have been noted to weight
the text directly surrounding a link with greater important and relevant than
the other text on the page. Thus, a link from inside an on-topic paragraph may
carry greater weight than a link in the sidebar or footer.
Subject Matter of the Linking Page - The topical relationship between the
subject of a given page and the sites/pages linked to on it may also factor into
the value a search engine assigns to that link. Thus, it will be more valuable
to have links from pages that are related to the site/page's subject matter than
those that have little to do with the topic.
These are only a few of the many factors Search Engine Optimization measure and weigh when
evaluating links. For a more complete list, see SEOmoz's search engine ranking
factors article.
Link metrics are in place so that Search Engine Optimization can find information to trust.
In the academic world, greater citation meant greater importance, but in a
commercial environment, manipulation and conflicting interests interfere with
the purity of citation-based measurements. Thus, on the modern WWW, the source,
style, and context of those citations is vital to ensuring high quality results.