Sequence Searches: How to interpret claim language?

In the previous blog post in this series, we examined the key challenges and best practices for a successful sequence search. These best practices range from selecting the right database and search tools, determining the correct algorithm and setting the parameters, reinforcing the search strategy by including keyword-based searches, conducting a protein search instead of nucleotide search and having access to up-to-date data. It’s essential that searchers working with biological sequence intellectual property clearly understand the subtleties of sequence claims and effectively analyze various algorithms while conducting sequence searches. The question we must address now is: “How do we interpret claim language?”

Analyzing genus claims for biological sequences

In order to ensure that genetic inventions are properly protected, the claim must include the genetic sequence literally described in the patent description (in the formal requirement), as well as a set of related sequences (aka genus), even remotely, in terms of structure and/or biological activity (Dufresne & Duval, 2004). Strategies such as the Markush formula, the percentage of identity, the percentage of similarity, and specific positions and types of substitutions are the main types of genus claims for a set of sequences.

Identity and Similarity Percentage Analysis 

Identity claims must have the same base or amino acid (exact match; no substitution/mutation) at an equivalent position obtained in an optimal alignment. It can be quantified and normalized. Identity is reported in percentage.

Similarity is also expressed in percentage and is computed by considering all identical and favorable substitutions. For example, K and R replacement (both are basic amino acids) or purine-purine substitution. The similarity between amino acids can be defined either by their chemical properties or based on a PAM matrix.  Basic Local Alignment Search Tool (BLAST)  is the most widely used, while FASTA, SSEARCH, and other commonly used similarity searching programs also produce accurate statistical estimates that can be used for calculating the amount of similarity between two sequences.

In our example (Table 1), Claim 1 encompasses nucleic acids that encode the polypeptide of SEQ ID NO: 1, as well as those that encode any polypeptide having 85% structural identity to SEQ ID NO: 1.  There aren’t guidelines regarding which 15% of the amino acids may vary from SEQ ID NO: 1 and there is no functional limitation on the nucleic acids of claim 1 other than that they encode the polypeptide of SEQ ID NO: 1 or any polypeptide having 85% structural identity to SEQ ID NO: 1. 

Therefore, more than 85% of identity hits should be relevant. However, what if the query sequence comprises SEQ ID No: 1 as part of a much longer sequence. Alignment and subject percentage identity determine that this hit is significant; but a long query sequence would give a low query percentage identity (Fig. 1).

For a hit of this nature, the subject percentage identity would be of prime importance, which identifies a full length hit to SEQ ID No: 1 as significant, regardless of the length of the query sequence.

Figure.1: Percentage identity calculation

Markush formula

Sequence claims, just like chemical structure claims, may be written as Markush structures (represented by Xaa and described in words). It’s impractical for a searcher to narrow down all combinations of query sequences as explicit queries. To address this problem, tools like GenomeQuest and STN have algorithms that allow these Markush sequences to be written a single time which include variations. Herein, the wild type hits could be removed and then the remaining results would comprise all possible variations within these constraints. However, what if the sequence of interest is included in a long list of variations which was not indexed? For this reason, it’s important to include keyword-based searches, such as Asp76Glu or D76Q in any variant sequence IP analysis.

Table 1. Strategies used for Genus Claims

Type Biological sequences genus claim Example
Percentage of identity A threshold value for a percentage of identity with the specified sequence An isolated nucleic acid that encodes a polypeptide with at least 85% amino acid sequence identity to SEQ ID NO: 1.
Markush formulae A list of alternatively functioning biological sequences A peptide to any of the preceding claims, characterized in that it corresponds to the general formula (I): Xaa1, Xaa2 His Xaa4 Pro Gly Ser Phe Ser Asp Glu Gly Asp Trp Leu; wherein Xaa1 is His or Thr; Xaa2 is Ala, Gly or D-Cpa (4-chloro-Phe); and Xaa4 is Gln, Asn or Pro.
Percentage of similarity A threshold value for a percentage of similarity with the specified sequence An isolated variant of a protein comprising the amino acid sequence shown in SEQ ID NO:1, wherein the variant comprises an amino acid sequence that is at least 95% similar to SEQ ID NO:1.
Variation in specified position A multiple position variation relative to the specified sequence A phytase characterized by comprises at least one alteration and no more than 4 alterations as compared to SEQ ID NO:2 wherein at least one of said one to four alterations is selected from the following: N4P, N31C, W46E, K107G, Q111P, E119K, S162C, D202N, Q223E, E241Q, M273L, T276K, N286Q, I362K,R, I379K, N385D, G52C/A99C, G59C/F100C, Q111P/E241Q, K141C/V199C, S162C/S247C, N31C/T177C and W46C/Q91C, and wherein the phytase has an improved thermostability compared to SEQ ID NO:2.

 

Optimizing search platforms and algorithms

Different algorithms handle matching differently, and the percent identity will also differ as a function of algorithm choice and parameters. When searching for the biological sequence, the right parameters need to be optimized using different algorithms, substitution matrix or exact search vs. similarity search. Furthermore, sequence search algorithms have settings which affect the percent identity, and even the type of sequence they may find (or miss). Alignments created with the same algorithm, but different parameters, often give different results.

Example:  US 13/382,953

During a prior art search, a standard USPTO search was conducted with gap penalty set at 10 and extend-penalty set at 1.0. At this setting no prior art was identified. However, when the parameter was optimized to gap-penalty set at 0.1 and extend-penalty set at 0.1, a new prior art Samulski et al. (U.S. Patent No. 8,198,421) was identified and this led to new ground of rejection. Samulski discloses at least nucleotide sequence for modified and optimized FVIII gene comprising the sequence of SEQ ID NO: 12, having 77.2% sequence identity to the sequence of SEQ ID NO: 1 of the current application. As a result, the examiner gave a non-final rejection under which claims 1, 7-8, 11 and 14 were rejected as allegedly being anticipated by Samulski et al. In response, the claim 1 is amended to be directed to an isolated nucleic acid molecule comprising a nucleotide sequence having at least 80% homology to the nucleotide sequence of SEQ ID NO: 1.

Table 2. Sequencing Platforms & Algorithms

Platform Algorithm Key Features
GenomeQuest -GenePast -Motifs -Fragments -Multiple sequence searches -BLAST -Adapted algorithm (optimized BLAST) -CDR antibody search using venn diagram -Proprietary algorithm for short seq (< 25 residues)
STN -Motif -Subsequence/exact match -BLAST -Uncommon amino acid -Combination with CAS no and controlled terms -Low complexity region filter
NCBI -BLAST -Open access -Cross linked with Pubmed
Patentlens

 

Conclusion

Understanding the subtleties of various algorithms and their parameters, as well as the different percent identities, is critical in assessing biological sequence IP. In addition, searchers should carefully assess claim construction and interpretation. In our next blog, we‘ll compare patent office prosecution in the European Union, the United States and Japan about genus claims for biological sequences.

Sign-up if you would like to speak to one of our experts about how Evalueserve can help with your sequence searches.

Additional insights: 

 

Priyanka Paul
Posts

Priyanka Paul is an associate vice president with the IP and R&D Solutions at Evalueserve, leading a team of Life Sciences and Healthcare professionals. Her expertise spans the entire lifecycle of research: from understanding client business needs; gathering, analyzing, and presenting quantitative and qualitative data to deliver strategic insights; and collaboratively developing innovative solutions to meet business challenges that aid life sciences & healthcare outcomes.
An avid traveller, Priyanka loves going on long drives and visiting new places, and enjoying local delicacies. Equally enthusiastic to explore the world of IP research, Priyanka looks forward to using the Information Adventurers blog to share ideas and invite readers to join the conversation to share their experiences, learnings, and perspectives on this topic.

Sudhanshu Sekhar Das
Manager, IP and R&D Solutions Posts

Sudhanshu Das is a manager with the IPR&D Solutions at Evalueserve, where he supervises IP projects in various technical domains and provides end-to-end IP support to global leaders, law firms, and innovative companies. With a PhD degree Sudhanshu has always been captivated by research, as it involves the spirit of innovation, and most importantly, the freedom to think independently. Some of Sudhanshu's area of expertise includes, gene therapy, next-generation sequencing, sequence searching, antibodies, biochemistry, vaccines, fermentation and more.

Latest Posts