I came across this statement on the Web earlier this week, and wondered about it, and decided to investigate more: If there are multiple instances of the same document on the web, the highest authority URL becomes the canonical version. The rest are considered duplicates. ~ Link inversion, the least known major ranking factor. I read that article from Dejan SEO, and thought it was worth exploring more. As I was looking around at Google patents that included the word âAuthorityâ in them, I found this patent which doesnât quite say the same thing that Dejan does, but is interesting in that it finds ways to distinguish between duplicate pages on different domains based upon priority rules, which is interesting in determining which duplicate page might be the highest authority URL for a document. The patent is:
Identifying a primary version of a document Abstract
Since the claims of a patent are what patent examiners at the USPTO look at when they are prosecuting a patent, and deciding whether or not it should be granted. I thought it would be worth looking at the claims contained within the patent to see if they helped encapsulate what it covered. The first one captures some aspects of it that are worth thinking about while talking about different document versions of particular documents, and how the metadata associated with a document might be looked at to determine which is the primary version of a document:
This doesnât advance the claim that the primary version of a document is considered the canonical version of that document, and all links pointed to that document are redirected to the primary version. There is another patent that shares an inventor with this one that refers to one of the duplicate content URL being chosen as a representative page, though it doesnât use the phrase âcanonical.â From that patent: Duplicate documents, sharing the same content, are identified by a web crawler system. Upon receiving a newly crawled document, a set of previously crawled documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query-independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions. In some embodiments, a method for selecting a representative document from a set of duplicate documents includes: selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score, where each respective document in the plurality of documents has a fingerprint that identifies the content of the respective document, the fingerprint of each respective document in the plurality of documents indicating that each respective document in the plurality of documents has substantially identical content to every other document in the plurality of documents, and a first document in the plurality of documents is associated with the query-independent score. The method further includes indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including only the indexed first document in a document index. This other patent is: Representative document selection for a set of duplicate documents Abstract
Regardless of whether the primary version of a set of duplicate documents is treated as the representative document as suggested in this second patent (whatever that may mean exactly), I think itâs important to get a better understanding of what a primary version of a document might be. The primary version patent provides some reasons why one of them might be considered a primary version: (1) Including of different versions of the same document does not provide additional useful information, and it does not benefit users. Those are the three reasons this duplicate document patent says it is ideal to identify a primary version from different versions of a document that appears on the Web. The search engine also wants to furnish âthe most appropriate and reliable search result.â How does it work?The patent tells us that one method of identifying a primary version is as follows. The different versions of a document are identified from a number of different sources, such as online databases, websites, and library data systems. For each document version, a priority of authority is selected based on: (1) The metadata information associated with the document version, such as
(2) As a second step, the document versions are then determined for length qualification using a length measure. The version with a high priority of authority and a qualified length is deemed the primary version of the document. If none of the document versions has both a high priority and a qualified length, then the primary version is selected based on the totality of information associated with each document version. The patent tells us that scholarly works tend to work under the process in this patent:
Meta data that might be looked at during this process could include such things as:
The patent goes into more depth about the methodology behind determining the primary version of a document:
The patent includes a table illustrating the source-priority list. The patent includes some alternative approaches as well. It tells us that âthe priority measure for determining whether a document version has a qualified priority can be based on a qualified priority value.â
Take awaysI was in a Google Hangout on air within the last couple of years where I and a number of other SEOs (Ammon Johns, Eric Enge, Jennifer Slegg, and I) asked some questions to John Mueller and Andrey Lipattse, and we asked some questions about duplicate content. It seems to be something that still raises questions among SEOs. The patent goes into more detail regarding determining which duplicate document might be the primary document. We canât tell whether that primary document might be treated as if it is at the canonical URL for all of the duplicate documents as suggested in the Dejan SEO article that I started with a link to in this post, but it is interesting seeing that Google has a way of deciding which version of a document might be the primary version. I didnât go into much depth about quantified lengths being used to help identify the primary document, but the patent does spend some time going over that. Is this a little-known ranking factor? The Google patent on identifying a primary version of duplicate documents does seem to find some importance in identifying what it believes to be the most important version among many duplicate documents. Iâm not sure if there is anything here that most site owners can use to help them have their pages rank higher in search results, but itâs good seeing that Google may have explored this topic in more depth. Copyright © 2018 SEO by the Sea â. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately. Plugin by Taragana The post How Google might Identify Primary Versions of Duplicate Pages appeared first on SEO by the Sea â. from http://feedproxy.google.com/~r/seobythesea/Tesr/~3/6vBWs5EtsmQ/
0 Comments
Leave a Reply. |