Google Patent: Personalization of Placed Content Ordering in Search Results

This is a summary of Google’s recent patent on personalized search results. It’s in a similar format as Rand’s Historical Data report for two reasons. First reason is for consistency, I’m assuming everyone is familiar with his report and thus would be familiar for this format. Second reason is because I can’t think of a better format.

Everything here is my own interpretation of the patent.

Overview of Important Concepts

These concepts are what I believe are the most important for search engine optimizers and marketers to understand in order to benefit from this report.

Google’s Goal to “Personalize” Search

Google understands that currently not all search results are relevant for everyone. For instance, if someone searches for “blackberry”, how does Google know if you are searching for blackberry devices or a black berry for cooking? Depending on the searcher, one topic will be more relevant than the other. In order to present better search results, personalized results will take a user’s profile (more on that later) into consideration. So not only will Google be ranking websites on typical factors (linkage, textual analysis, click through rate) but they will now be incorporating user historical information.

User Profile

The user profile is based upon user history. The patent specifically outlines: user search query history, documents returned in the search results, documents visited in the search results, anchor text of the documents, topics of the documents, outbound links of the documents, click through rate, format of documents, time spent looking at document, time spent scrolling a document, whether a document is printed/bookmarked/saved, repeat visits, browsing pattern, groups of individuals with similar profile, and user submitted information. All this data can be mined by programs Google has already released – Google Desktop Search, Personalized Search History, and Google Toolbar.

Search Query History

One of the biggest concepts of the patent is Google’s use of search query history. Google is building a profile based on past searches, but they are tracking many more things than previously thought. They are using past search queries to form user term profiles and then comparing the term profiles to profiles of placed content (advertisements). They are also performing analysis on search history documents to figure out what types of documents interest you. Linkage data will play a big role here.

Content Profile

Content profiles will be generated for advertisements. The content profiles consist of categories (sets of terms) mapped to a specific weight. Basically a value system – for example a category of sports may consist of the terms “basketball”, “football”, and “soccer” – each category with a varying weight based on relevancy to the content. Time to read up on term-weight vectors. These content profiles are then compared to the user profile in order to generate a similarity score for ranking.

What is Google trying to measure?

Google wants to measure, is attempting to measure, or already measures:

    Document Information

  • Textual Analysis
  • Linkage Data
  • Topical Analysis
    Personal Information

  • Demographic
  • Geographic
    User Behavior Online

  • Past Search Queries
  • Visited Documents
  • Click Through Rate
  • Search Pattern
  • Browse Pattern
  • Document visit pattern
  • Time spent viewing document
  • Amount of document viewed
  • Similar users
  • Repeat visits/searches
  • Favorite hosts/sites
    User Behavior Offline

  • Bookmarked Sites
  • Saved sites
  • Document format preference
  • Document language preference
  • Text saved (copy/paste)
  • Printed documents

6. Impact

This is where search engines are heading. Personalized search means better search results, so anyone planning to continue with search engine marketing needs to understand how personalization is going to affect the future of search.

Analysis/Interpretation of 52 Patent Components

Personalization of Advertisements

1. Ads will be personalized based upon interests of user and user profile. Assuming this will lead to more effective ad campaigns, this means more money for Google and better leads for businesses.
2. Bid is a factor in ranking order (as related to user profile). This allows for more specific bidding in the future. Suppose you only want to target a specific user profile, you can place higher/lower bids based upon profile instead of search query.
3. Click through rate is a factor in ranking order. Again this is related to user profile.

Search Query

4. Ads will be personalized based upon old search queries and user profile. Past search queries are already being saved by Google Search History.
5. Bid is a factor in ranking order.
6. Click through rate is a factor in ranking order.

Scoring Factors for Ad Placement

7. Ads will be personalized based upon search query, user profile, matching content set, bid and click through rate.

Search History

8. User profile is in part based upon previous search query terms.
9. User profile also based upon search results returned. For each of the following documents, Google is monitoring user browsing pattern (visits, time spent, links followed), document analysis (text, anchor text, linkage), and offline pattern (printing, saving, partial save).
a. Documents listed in search results
b. Documents linked to by these documents (a)
c. Documents browsed by the user
d. Documents linked to by the documents browsed by the user

Similarity Score

10. A similarity score is calculated to determine how related the content is to the user profile.
11. User profile category weights are compared to content profile category weights to determine the score. Quick and dirty explanation: terms in the advertisement relates to terms in the user profile. I’d recommend reading up on term-vector weight.
12. A scaling factor is applied.
13. Associate scaling factor to a bunch of other things (more on this later).
14. Score consists of scaling factor, bid, and click through rate.
15. The bigger the score, the lower the scaling factor.
16. Scaling factor is determined by click through rate and similarity score.
17. Typo? Refers to claim 71, but there is no claim 71.

Advertisement Server (Google Mini)

18. A server that assigns scores to different advertisements based upon relevancy to a user profile. Identifies ads that will be of interest to a user. This sounds like a google mini for serving personalized ads.
19-34. Same stuff as 1-17.

Software (Google Desktop/Toolbar)

35. Computer program that identifies user interests and serves personalized content. Sounds like Google Desktop and personalized ads.
36-51. Same stuff as 1-17.

System

52. Google first takes a search query, then it finds the user profile, grabs the ad content that matches the search query, compares the ad content to user profile, then ranks the ad content based on interest score.

Patent Description

The invention relates to search engines in a network environment (internet or intranet) and the creation/use of user profiles to rank content (advertisements and search results).

Background

Before there was PageRank, now there is Personalization.

Every user has his own preferences when he submits a query to a search engine. The quality of the search results returned by the engine has to be evaluated by its users’ satisfaction. When a user’s preferences can be well defined by the query itself, or when the user’s preference is similar to the random surfer’s preference with respect to a specific query, the user is more likely to be satisfied with the search results. However, if the user’s preference is significantly biased by some personal factors that are not clearly reflected in a search query itself, or if the user’s preference is quite different from the random user’s preference, the search results from the same search engine may be less useful to the user, if not useless.

Summary

Google is going to use historical information to generate a user profile. This user profile will be used to rank search engine results and advertisements.

Drawings

Illustrates everything that I went through before.

Description of Embodiments

This is the explanation of all the points presented earlier. I’d recommend everyone actually spending time to read through this part in the patent.

User Profile

Google is generating the user profile based upon data from a user’s past search activity. Google describes some of the various information sources they feel are important enough to track.

Google uses previously submitted search queries to help determine a user’s interest. If a user consistently searches for documents related to a specific topic, Google infers that the user will be interested in that topic. URLs and anchor text is also being used to determine interests. What I find real interesting is that Google specifically states the URLs and anchor text of the search results – most of the time the anchor text of a search result is the page title, so here’s a prime example of the importance of page title for search engine marketing.

Identified Documents

Documents that Google finds important for a user profile is called an “identified document”.

For purposes of forming a user profile, the identified documents from which information is derived for inclusion in the user profile may include: documents identified by search results from the search engine, documents accessed (e.g., viewed or downloaded, for example using a browser application) by the user (including documents not identified in prior search results), documents linked to the documents identified by search results from the search engine, and documents linked to the documents accessed by the user, or any subset of such documents.

For each identified document, Google determines: the format of the document, what language the document is in (html/text/pdf/word/etc), topic of document, how the user responds to the document, time spent viewing, scrolling activity, printed, saved, bookmarked.

Key terms and their frequencies are analyzed for each document. (Term-weight)

Browsing Patterns

Another source of information that Google tracks is user browsing patterns. This maybe the number of URLs visited in a certain time frame or how a user moves from one URL to another.

Age Weight

Profile information is weighted by age, so the more recent the information the more important it is. (Refer back to rand’s analysis)

Personal Information

Optional personal information may be used to rank content.

Demographic and geographic information associated with the user, such as the user’s age or age range, educational level or range, income level or range, language preferences, marital status, geographic location (e.g., the city, state and country in which the user resides, and possibly also including additional information such as street address, zip code, and telephone area code), cultural background or preferences, or any subset of these.

User Profile Consists of Smaller Profiles

A user profile can be broken down into three smaller profiles: term-based, category-based, link based.

A term-based profile represents interest based on specific terms. These profiles show how important a specific term is to a user. If a document matches a term in a user’s term-based profile (same term in both the document and profile), then the document is assigned that term’s weight. Notice how I am using “term” and not “word”, this is because a term can contain more than one word. The weight of a term can be positive or negative. A positive weight means the user is interested in seeing that term in the results, a negative weight means the opposite.

Since a term-based approach has some flaws, a category-based profile is needed also.

Category-based profiles can be generated from category maps like DMOZ which groups documents under specific topics. These categories are then weighed to represent user interest. These categories are determined by search history, urls identified by previous search queries, general information about the identified documents, sampled content, category information, and users personal information. Category-based profiles do not have to be topically organized. They can be organized by format, location, origin, language, etc. Google specifically states that the type of document may have a different weight:

In one embodiment, a user’s preference can be categorized based on the formats of the documents identified by the user, such as HTML, plain text, PDF, Microsoft Word, etc. Different formats may have different weights. In another embodiment, a user’s preference can be categorized according to the types of the identified documents, e.g., an organization’s homepage, a person’s homepage, a research paper, or a news group posting, each type having an associated weight.

In addition to category-based and term-based profiles, link-based profiles are also used. Link-based profiles are determined by PageRank, list of urls frequently accessed, time spent at url, preferred hosts. Subdomains pass value to the parent domain. Weights are determined by how far the document is from the identifying document. So for instance if a search result returns the homepage for Search Engine Watch and from that page you can reach SEOmoz in two links and my site in one link, my site will have a larger weight.

Term Analysis

Here Google talks a little bit about how they perform document analysis.

Given a particular document, Google determines the value of specific terms by location and importance. For instance they state that the document’s title may be very important while navigation/copyright statements/disclaimers are not as important.

Paragraph Sampling

Assuming less relevant content is usually short segments of text, Google finds the most important areas of content. Paragraph sampling looks for the longest paragraphs in a document. Paragraphs are then processed in order of decreasing length. If there are not enough paragraphs to analyze, Google then pulls text from anchor text and alt tags. Paragraphs are determined by appearance in browser.

Context Analysis

The content is then scanned for patterns of words (prefix/postfix). Google looks for the words before and after a specific term. They then give a weight to these prefix/postfix words. Specific prefix/postfix words may weight more than others. So not only is Google looking for patterns in specific search terms, they are looking for patterns in the words surround the search term.

Formulas/Calculations

The rest of the document walks through some generalized formulas and calculations. I’m urging everyone to read through this part, because I’m not going to summarize it.

Placed Content (ads):
Each placed content has a profile associated with it. Compare the content profile to the user profile, then obtain a similarity score.

Score = scaling factor x CTR x bid

Training:
{ For each important term in the document { For m = 0 to MaxPrefix { For n = 0 to MaxPostfix { Extract the m words before the important term and the n words after the important term as s; Add 1 to ImportantContext(m,n,s); } } } For each unimportant term in the document { For m = 0 to MaxPrefix { For n = 0 to MaxPostfix { Extract the m words before the unimportant term and the n words after the unimportant term as s; Add 1 to UnimportantContext(m,n,s); } } } } For m = 0 to MaxPrefix { For n = 0 to MaxPostfix { For each value of s { Set the weight for s to a function of ImportantContext(m,n,s), and UnimportantContext(m,n,s); } } }

Context Pattern:
Weight(m, n, s)=Log(ImportantContext(m, n, s)+1)-Log(UnimportantContext(m, n, s)+1).

Generic Score of Document:
GenericScore=QueryScore*PageRank.

Personalized Score:
PersonalizedScore=GenericScore*(TermScore+CategoryScore+LinkScore).

Final Score:
FinalScore=ProfileScore*ProfileConfidence+GenericScore*(1-ProfileConfidenc- e).

Conclusion

Google talks about other uses for this personalization system outside of search engine systems.

For instance, in an email system or in virtually any other system for providing services via the Internet or other wide area network that displays a document or other content to a user or subscriber, placed content may be also be selected and displayed to the user. The placed content may be selected based on the keywords associated with the placed content matching the content of a displayed document or set of documents, or it may be based on the other selection criteria. The selected placed content items are then ordered based on similarity of the user profile and profiles of the selected placed content items, as described above.

2 thoughts on “Google Patent: Personalization of Placed Content Ordering in Search Results

  1. Pingback: Google Patent: Personalization of Placed Content Ordering in Search Results by Michael Nguyen at SocialPatterns.com - Jaan Kanellis SEO Blog

  2. Pingback: » Patente de personalización de resultados

Comments are closed.