Google scans over one trillion indexed pages

August 4, 2008

by Mark Ollig

Whenever a certain number milestone is reached – especially when it is a large number – we tend to stop and look at it.

Today’s milestone number is one trillion (as in 1,000,000,000,000).

The engineers of the popular online search engine Google, made the announcement last week.

The number of unique URL linked pages it has found and recorded over the Internet went over 1 trillion.

I wonder what our old friend Paul Otlet would have thought about this!

As you know, a URL (Uniform Resource Locator) is an address that specifies the location of a page file on the Internet.

For example, “http://writ.umn.edu/courses/” is a unique URL that holds writing courses available at the University of Minnesota.

In 1998 the first Google index found 26 million unique pages – by 2000 it reached a remarkable one billion pages.

As far the total amount of web page “links” goes, they are endless, and within most web pages there are many links.

This milestone focuses on the amount of “unique” web page-links out there on the Internet.

As you know, the huge amount of information Google provides can sometimes be quite overwhelming.

I heard one humorous explanation that compares using Google for finding information similar to taking a sip at a drinking fountain and having the water spray at you with the full pressure of a fire-hose.

I visited the “Official” Google blog, and read the online post by Jesse Alpert and Nissan Hajaj, who are software engineers for the Google Web Search Infrastructure Team.

In the blog they posted, “To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the “Page Rank” graph on 26 million pages in a couple of hours, and that set of pages would be used as Google’s index for a fixed period of time.”

Alpert and Hajaj went on to say, “Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day.”

“This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it’d be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections,” they both said in the blog.

How do they find all these linked pages?

Here is what Alpert and Hajaj said in the Google blog: “We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links. In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages.

Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day.”

I feel a headache coming on . . . Where’s my bottle of Advil?

Google was also in another news story recently.

A new search engine called “Cuil” (pronounced “cool”) recently went online.

Cuil is the creation of former Google engineer Anna Patterson, who created Google’s “TeraGoogle” search index software in 2006.

She worked for Google from 2004 through 2006.

Patterson teamed up with her husband, Tom Costello, who built a search engine called “Xift” in the late 1990s. He had worked for IBM and designed an “analytic engine” called “WebFountain.”

Costello’s Irish heritage inspired the name “Cuil” to be used for their new web site.

I learned Cuil is named after the Celtic folklore character Finn McCuill.

Patterson said she enjoyed her time at Google but “. . . became disillusioned with the company’s approach to search.”

In one quote she seems to be disgruntled with her former employer. “Google has looked pretty much the same for 10 years now,” Patterson said, “and I can guarantee it will look the same a year from now.”

In using the Cuil search engine (which currently has no advertisements) I was able to find fairly good results for the basic search terms I entered.

I also liked the three column display of the search results and the photos placed next to each query. Granted, it just went online so it will no doubt be going through many updates and improvements.

Cuil will definitely need some enhancements in order for it to become the “Google killer of search engines” which it has been touted as being in some articles.

You can visit this new search engine on the Web at: http://www.cuil.com.

The “Official Google Blog” is located at http://googleblog.blogspot.com.