Best Practices for Sequence Searches: Designing a high quality and reliable search process

Life Sciences & Healthcare

Best Practices for Sequence Searches: Designing a high quality and reliable search process

Priyanka Paul

In her previous blog posts about Freedom to Operate (FTO) Searches, Priyanka Paul described how an experienced searcher blends the right skills, tools, and techniques to avoid potential FTO search pitfalls. These blog posts explained the key parameters that searchers need to consider to ensure an effective FTO search. We all work under various constraints for timelines, budget, databases and skills. Therefore, the balancing act of these constraints becomes very important.

This balancing act is especially difficult in Sequence searches. In Biotech research, there are some specific challenges, roadblocks and complexities that searchers must overcome when conducting a biological sequence search. This blog post describes these challenges and provides some best practices for a successful sequence search.

Sequence Search – How is it different from a regular search?

It’s different. It’s complex. It’s difficult.

Conducting a sequence-based search is like a game of treasure hunt. We know the treasure (i.e. relevant hits) lies somewhere out there, but the depth and breadth of the search universe makes the search overwhelming – often leading to disappointment. Therefore, it’s important to have the right experience, expertise and tools to get this right.

Sequence search involves using a specific sequence (Nucleic acid or Amino acid) as an input query in a database to identify a dataset that matches with the query input sequence. A Sequence search query may vary from 10 base pairs for primers/ probes to more than 1,000 base pairs for large genes/ proteins.

Due to the complexities of this search process and the retrieved dataset, the key question is: “How can we gather reliable and high–quality data in sequence searches?”

A new sequence search request – Let’s get into the searcher’s mind

As we put ourselves in the searcher’s mindset, there are a number of questions swirling around. To begin, how should we design the search query: search protein or DNA or both? Should the scope be wide or narrow? Which database and tool should we use? What parameters do we select? What algorithms do we need to consider? It can be overwhelming to consider all of the potential questions and factors.

The advent of genomics has created a logistical challenge for patent searches. Creating search queries, retrieval and mapping/alignment of sequences, and data presentation: the world of sequence is complex from the start until the end, as an element of uncertainty is present across every step. Moving forward with a research project without properly searching, evaluating, and analyzing sequences can prove to be costly in the long run.

At Evalueserve, we follow a robust search process which starts with data collection, followed by data processing. By using various tools and algorithms, we can optimize the data of search results to increase precision. The final stage of this search process, and the most critical, is to present the findings in an intuitive reporting format (Figure 1).

This is where experience and understanding of the domain, selection of databases and fine–tuning the algorithms come in handy.

Figure 1: Sequence search workflow depicting the expansion of coverage and precision achieved through optimization of multiple parameters

Five Best Practices for Sequence Searches

Most experienced searchers use at least some of these five best practices to obtain targeted and accurate results:

1. Selecting the right database and search tool

A patent search is all about utilizing patent databases or search tools (patent office sites, free providers, or paid services) to achieve the maximum potential. For bio-sequence searches, this selection is even more critical, as getting a comprehensive base file is difficult due to (a) availability of data – wherein the source could be patent text, drawings, tables, or attachments, and (b) lack of data standardization and indexing – especially in case of sequence variants. The more familiar the searcher is with the database, the better the output will be.

Most of these databases have integrated specialized patent search tools. Their coverage, purpose, and price vary widely. Searchers must evaluate several factors prior to executing a sequence search. Recommending one engine over another would be incorrect, as the outcome depends upon the objective of the search at hand. As a user, the key selection criterion might be: data coverage, algorithm, delivery format (document retrieval), import and export function, as well as ease of use.

2. Selecting the algorithm and setting the parameters

When conducting a sequence search, the search universe is enormous. By using various databases, the searcher can be confident of coverage. However, the enormity of data to be analyzed can make the analysis activity a daunting task.

A good search requires selecting the right parameters. Various databases have different syntaxes towards creating the search query. This is often a complicated step. The query length and extent of coverage (closest or divergent) plays a critical role in the parametric selection process (e.g. matrices, gap penalties, e-value). For example, we would look for the following parameters in each of these searches:

Identity search: Exact, Sub-sequence, Motif, Uncommon sequence
Similarity search (percent of sequence identity): BLAST

It is very important for a searcher to carefully tackle sequences with low complexity regions (e.g. poly-A tail or proline rich regions) as it could lead to incorrect/biased output. Unlike other types of searches, a progressive search strategy is useful in sequence searches as it aims to address the objective in a structured manner. In this case, the searcher would start with 100% identity and drop this percentage for shorter sequences. As we scope the project, the searcher must consider assembling multiple search concepts. For example, these search concepts might include variable regions (VH and VL) peptide sequence, DNA sequence, and CDR regions for antibody-based sequence search.

3. Reinforcing the search strategy by including keyword-based searches

Often, in the process of evaluating the search results, the searcher must streamline the evaluation process by implementing supplementary keyword-based searches. This can be done using various techniques such as:

features of the inventive concept
names of proteins and genes or their biological targets, etc.
the functional aspects of the biomolecule
application areas

This step also is extremely important when the search sequence yields a huge data set which often becomes difficult to evaluate. At this point, the searcher must implement a logical process to filter the data set to evaluate a focused set of records.

4. Protein search over nucleotide search

Searching for both amino acid and nucleotide sequence helps to ensure a comprehensive search. However, amino acid sequence search is more sensitive than a nucleotide sequence, and hence considered effective. Sensitivity for nucleotide search decreases primarily due to the presence of non-coding regions or silent mutations which further leads to non-relevant hits in the dataset. The percentage sequence identity/ homology alignment score between nucleotide sequences is less than at the protein level. Therefore, protein search is recommended over nucleotide search, while searching for related sequences in various search databases.

5. Access to up-to-date data

Any search output is only as good as the input that is processed. Analyzing outdated data will result in an incomplete assessment. Genome sequencing is a dynamic subject. Every day, the data universe keeps expanding as additional data sources, and new scientific and patent literature are added. The higher the number of sequences involved in the search process, the higher the risk will be. Thus, it is important to have regular validations and updates to enhance the robustness of the search process.

Sequence Search – Can we do it?

It’s doable! It’s manageable! It’s reliable!

Sequence searching is a very niche segment and forms an integral part of almost all spheres of the life-sciences domain ranging from Therapeutics, Diagnostics, and fundamental research.

An experienced searcher can add significant value in the search process, and this acts as a big differentiator. With the analytical skills and judgement of the searcher, the huge sea of database results is crystallized into actionable insights.

By leveraging these guidelines and establishing clear search objectives from the start, (understanding the “what” of the search) searchers can achieve high levels of accuracy and recall. Relying only on databases (commercial, public) and tools to manipulate sequence data may not be enough.

Understanding the claim language is critical in assessing the relevancy of results. In fact, high-quality sequence search relies on the searcher’s claim interpretation skills. In our next blog, we‘ll discuss how a searcher interprets claim language and effectively analyses various algorithms while conducting sequence searches.

Click the link to learn how to interpret claim language in Priyanka’s second blog post: Sequence Searches: How to Interpret Claim Language?

Written By

Priyanka Paul

Posts

Priyanka Paul is an associate vice president with the IP and R&D Solutions at Evalueserve, leading a team of Life Sciences and Healthcare professionals. Her expertise spans the entire lifecycle of research: from understanding client business needs; gathering, analyzing, and presenting quantitative and qualitative data to deliver strategic insights; and collaboratively developing innovative solutions to meet business challenges that aid life sciences & healthcare outcomes.
An avid traveller, Priyanka loves going on long drives and visiting new places, and enjoying local delicacies. Equally enthusiastic to explore the world of IP research, Priyanka looks forward to using the Information Adventurers blog to share ideas and invite readers to join the conversation to share their experiences, learnings, and perspectives on this topic.

Data Analytics

Customer Analytics

Pricing

Business Intelligence

Supply Chain

Data Engineering & Cloud

Insights & Intelligence

Commercial Excellence

Competitive & Market Intelligence

Strategy & Planning

Operational Excellence

Supply Chain & Procurement

Knowledge Management

Investment Banking Advisory

Business Information Services

CRM & Business Management Support

Deal Execution

Desktop & Publishing Services

Product Support

Transaction Advisory

Valuation

Investment Management & Research

Asset Management

Publishing & Distribution

Core Research

Research Support

Data Solutions

Lending Services

Facility Origination & Sales Support

Independent Review

Risk Management

Loan Servicing

Credit Underwriting

Portfolio Monitoring

Private Markets

Private Equity Advisory

Private Markets Technology

Investment Operations Support

Private Credit

Index & QIS

Risk Transformation

RiskSight

Model Risk Management

KYC Compliance

Intellectual Property & R&D

IP and R&D Market Solutions

IP Consulting

IP Process Redesign

Patent Search

R&D and Innovation

Toxicology Consulting

Technology

Artificial Intelligence

Architecture

Agentic AI

Accelerators

Products

Insightsfirst

Spreadsmart

RiskSight

Partners

Technology Partners

Google Cloud Partnership

Industry Specific Expert-Driven Solutions.

Elevate Magazine

Resource Library

Client Stories

eBooks & Whitepapers

Ask an Expert

Blogs

Events & Webinars

Executive Exchange Series

Industry Insights