All About GOOGLE PAGERANK:- The story of Google’s algorithm - TopicsExpress

Alienware

All About GOOGLE PAGERANK:- The story of Google’s algorithm begins with PageRank, the system invented in 1997 by cofounder Larry Page while he was a grad student at Stanford. Page’s now legendary insight was to rate pages based on the number and importance of links that pointed to them — to use the collective intelligence of the Web itself to determine which sites were most relevant. It was a simple and powerful concept, and — as Google quickly became the most successful search engine on the Web — Page and cofounder Sergey Brin credited PageRank as their company’s fundamental innovation. But that wasn’t the whole story. “People hold on to PageRank because it’s recognizable,” Manber says. “But there were many other things that improved the relevancy.” These involve the exploitation of certain signals, contextual clues that help the search engine rank the millions of possible results to any query, ensuring that the most useful ones float to the top. Web search is a multipart process. First, Google crawls the Web to collect the contents of every accessible site. This data is broken down into an index (organized by word, just like the index of a textbook), a way of finding any page based on its content. Every time a user types a query, the index is combed for relevant pages, returning a list that commonly numbers in the hundreds of thousands, or millions. The trickiest part, though, is the ranking process — determining which of those pages belong at the top of the list. That’s where the contextual signals come in. All search engines incorporate them, but none has added as many or made use of them as skillfully as Google has. PageRank itself is a signal, an attribute of a Web page (in this case, its importance relative to the rest of the Web) that can be used to help determine relevance. Some of the signals now seem obvious. Early on, Google’s algorithm gave special consideration to the title on a Web page — clearly an important signal for determining relevance. Another key technique exploited anchor text, the words that make up the actual hyperlink connecting one page to another. As a result, “when you did a search, the right page would come up, even if the page didn’t include the actual words you were searching for,” says Scott Hassan, an early Google architect who worked with Page and Brin at Stanford. “That was pretty cool.” Later signals included attributes like freshness (for certain queries, pages created more recently may be more valuable than older ones) and location (Google knows the rough geographic coordinates of searchers and favors local results). The search engine currently uses more than 200 signals to help rank its results. Google’s engineers have discovered that some of the most important signals can come from Google itself. PageRank has been celebrated as instituting a measure of populism into search engines: the democracy of millions of people deciding what to link to on the Web. But Singhal notes that the engineers in Building 43 are exploiting another democracy — the hundreds of millions who search on Google. The data people generate when they search — what results they click on, what words they replace in the query when they’re unsatisfied, how their queries match with their physical locations — turns out to be an invaluable resource in discovering new signals and improving the relevance of results. The most direct example of this process is what Google calls personalized search — a feature that uses someone’s search history and location as signals to determine what kind of results they’ll find useful.1 But more generally, Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries. Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.” But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.” Throughout its history, Google has devised ways of adding more signals, all without disrupting its users’ core experience. Every couple of years there’s a major change in the system — sort of equivalent to a new version of Windows — that’s a big deal in Mountain View but not discussed publicly. “Our job is to basically change the engines on a plane that is flying at 1,000 kilometers an hour, 30,000 feet above Earth,” Singhal says. In 2001, to accommodate the rapid growth of the Web, Singhal essentially revised Page and Brin’s original algorithm completely, enabling the system to incorporate new signals quickly. (One of the first signals on the new system distinguished between commercial and noncommercial pages, providing better results for searchers who want to shop.) That same year, an engineer named Krishna Bharat, figuring that links from recognized authorities should carry more weight, devised a powerful signal that confers extra credibility to references from experts’ sites. (It would become Google’s first patent.) The most recent major change, codenamed Caffeine, revamped the entire indexing system to make it even easier for engineers to add signals. Google is famously creative at encouraging these breakthroughs; every year, it holds an internal demo fair called CSI — Crazy Search Ideas — in an attempt to spark offbeat but productive approaches. But for the most part, the improvement process is a relentless slog, grinding through bad results to determine what isn’t working. One unsuccessful search became a legend: Sometime in 2001, Singhal learned of poor results when people typed the name “audrey fino” into the search box. Google kept returning Italian sites praising Audrey Hepburn. (Fino means fine in Italian.) “We realized that this is actually a person’s name,” Singhal says. “But we didn’t have the smarts in the system.” The Audrey Fino failure led Singhal on a multiyear quest to improve the way the system deals with names — which account for 8 percent of all searches. To crack it, he had to master the black art of “bi-gram breakage” — that is, separating multiple words into discrete units. For instance, “new york” represents two words that go together (a bi-gram). But so would the three words in “new york times,” which clearly indicate a different kind of search. And everything changes when the query is “new york times square.” Humans can make these distinctions instantly, but Google does not have a Brazil-like back room with hundreds of thousands of cubicle jockeys. It relies on algorithms. The Mike Siwek query illustrates how Google accomplishes this. When Singhal types in a command to expose a layer of code underneath each search result, it’s clear which signals determine the selection of the top links: a bi-gram connection to figure it’s a name; a synonym; a geographic location. “Deconstruct this query from an engineer’s point of view,” Singhal explains. “We say, ‘Aha! We can break this here!’ We figure that lawyer is not a last name and Siwek is not a middle name. And by the way, lawyer is not a town in Michigan. A lawyer is an attorney.” This is the hard-won realization from inside the Google search engine, culled from the data generated by billions of searches: a rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around. “The holy grail of search is to understand what the user wants,” Singhal says. “Then you are not matching words; you are actually trying to match meaning.” And Google keeps improving. Recently, search engineer Maureen Heymans discovered a problem with “Cindy Louise Greenslade.” The algorithm figured out that it should look for a person — in this case a psychologist in Garden Grove, California — but it failed to place Greenslade’s homepage in the top 10 results. Heymans found that, in essence, Google had downgraded the relevance of her homepage because Greenslade used only her middle initial, not her full middle name as in the query. “We needed to be smarter than that,” Heymans says. So she added a signal that looks for middle initials. Now Greenslade’s homepage is the fifth result. At any moment, dozens of these changes are going through a well-oiled testing process. Google employs hundreds of people around the world to sit at their home computer and judge results for various queries, marking whether the tweaks return better or worse results than before. But Google also has a larger army of testers — its billions of users, virtually all of whom are unwittingly participating in its constant quality experiments. Every time engineers want to test a tweak, they run the new algorithm on a tiny percentage of random users, letting the rest of the site’s searchers serve as a massive control group. There are so many changes to measure that Google has discarded the traditional scientific nostrum that only one experiment should be conducted at a time. “On most Google queries, you’re actually in multiple control or experimental groups simultaneously,” says search quality engineer Patrick Riley. Then he corrects himself. “Essentially,” he says, “all the queries are involved in some test.” In other words, just about every time you search on Google, you’re a lab rat. This flexibility — the ability to add signals, tweak the underlying code, and instantly test the results — is why Googlers say they can withstand any competition from Bing or Twitter or Facebook. Indeed, in the last six months Google has made more than 200 improvements, some of which seem to mimic — even outdo — the offerings of its competitors. (Google says this is just a coincidence and points out that it has been adding features routinely for years.) One is real-time search, eagerly awaited since Page opined some months ago that Google should be scanning the entire Web every second. When someone queries a subject of current interest, among the 10 blue links Google now puts a “latest results” box: a scrolling set of just-produced posts from news sources, blogs, or tweets. Once again, Google uses signals to ensure that only the most relevant tweets find their way into the real-time stream. “We look at what’s retweeted, how many people follow the person, and whether the tweet is organic or a bot,” Singhal says. “We know how to do this, because we’ve been doing it for a decade.” Along with real-time search, Google has introduced other new features, including a service called Goggles, which treats images captured by users’ phones as search queries. It’s all part of the company’s relentless march toward search becoming an always-on, ubiquitous presence. With a camera and voice recognition, a smartphone becomes eyes and ears. If the right signals are found, anything can be query fodder. Google’s massive computing power and bandwidth give the company an undeniable edge. Some observers say it’s an advantage that essentially prohibits startups from trying to compete. But Manber says it’s not infrastructure alone that makes Google the leader: “The very, very, very key ingredient in all of this is that we hired the right people.” By all standards, Qi Lu qualifies as one of those people. “I have the highest regard for him,” says Manber, who worked with the 48-year-old computer scientist at Yahoo. But Lu joined Microsoft early last year to lead the Bing team. When asked about his mission, Lu, a diminutive man dressed in jeans and a Bing T-shirt, pauses, then softly recites a measured reply: “It’s extremely important to keep in mind that this is a long-term journey.” He has the same I’m-not-going-away look in his eye that Uma Thurman has in Kill Bill. Indeed, the company that won last decade’s browser war has a best-served-cold approach to search, an eerie certainty that at some point, people are going to want more than what Google’s algorithm can provide. “If we don’t have a paradigm shift, it’s going to be very, very difficult to compete with the current winners,” says Harry Shum, Microsoft’s head of core search development. “But our view is that there will be a paradigm shift.” Still, even if there is such a shift, Google’s algorithms will probably be able to incorporate that, too. That’s why Google is such a fearsome competitor; it has built a machine nimble enough to absorb almost any approach that threatens it — all while returning high-quality results that its competitors can’t match. Anyone can come up with a new way to buy plane tickets. But only Google knows how to find Mike Siwek.

Posted on: Sat, 28 Sep 2013 09:41:42 +0000

All About GOOGLE PAGERANK:- The story of Google’s algorithm - TopicsExpress

Trending Topics

Recently Viewed Topics