Best Practices for Sequence Searches: Designing a high quality and reliable search process

In her previous blog posts about Freedom to Operate (FTO) SearchesPriyanka Paul described how an experienced searcher blends the right skills, tools, and techniques to avoid potential FTO search pitfalls. These blog posts explained the key parameters that searchers need to consider to ensure an effective FTO search. We all work under various constraints for timelines, budget, databases and skills. Therefore, the balancing act of these constraints becomes very important.  

This balancing act is especially difficult in Sequence searches. In Biotech research, there are some specific challenges, roadblocks and complexities that searchers must overcome when conducting a biological sequence search. This blog post describes these challenges and provides some best practices for a successful sequence search.  

Sequence Search – How is it different from a regular search? 

It’s different. It’s complex. It’s difficult.  

Conducting a sequence-based search is like a game of treasure hunt. We know the treasure (i.e. relevant hits) lies somewhere out there, but the depth and breadth of the search universe makes the search overwhelming – often leading to disappointment. Therefore, it’s important to have the right experience, expertise and tools to get this right.  

Sequence search involves using specific sequence (Nucleic acid or Amino acid) as an input query in a database to identify dataset that matches with the query input sequence. A Sequence search query may vary from 10 base pairs for primers/ probes to more than 1,000 base pairs for large genes/ proteins. 

Due to the complexities of this search process and the retrieved dataset, the key question is: “How can we gather reliable and highquality data in sequence searches?” 

A new sequence search request – Let’s get into the searchers mind  

As we put ourselves in the searcher’s mindset, there are a number of questions swirling around. To begin, how should we design the search querysearch protein or DNA or both? Should the scope be wide or narrow? Which database and tool should we use? What parameters do we select? What algorithms do we need to consider? It can be overwhelming to consider all of the potential questions and factors.  

The advent of genomics has created a logistical challenge for patent searches. Creating search queries, retrieval and mapping/alignment of sequences, and data presentation: the world of sequence is complex from the start until the endas an element of uncertainty is present across every step. Moving forward with a research project without properly searching, evaluating, and analyzing sequences can prove to be costly in the long run. 

At Evalueserve, we follow a robust search process which starts with data collection, followed by data processing. By using various tools and algorithms, we can optimize the data of search results to increase precision. The final stage of this search process, and the most critical, is to present the findings in an intuitive reporting format (Figure 1). 

This is where experience and understanding of the domain, selection of databases and finetuning the algorithms come in handy.  

Best Practices for Sequence Searches: Designing a high quality and reliable search process
Figure 1: Sequence search workflow depicting the expansion of coverage and precision achieved through optimization of multiple parameters 

Five Best Practices for Sequence Searches

Most experienced searchers use at least some of these five best practices to obtain targeted and accurate results:  

1. Selecting the right database and search tool 

A patent search is all about utilizing patent databases or search tools (patent office sites, free providers, or paid services) to achieve the maximum potential. For bio-sequence searches, this selection is even more critical, as getting a comprehensive base file is difficult due to (a) availability of data – wherein the source could be patent text, drawings, tables, or attachments, and (b) lack of data standardization and indexing – especially in case of sequence variantsThe more familiar the searcher is with the database, the better the output will be. 

Most of these databases have integrated specialized patent search tools. Their coverage, purpose, and price vary widely. Searchers must evaluate several factors prior to executing sequence search. Recommending one engine over another would be incorrect, as the outcome depends upon the objective of the search at hand. As a user, the key selection criterion might be: data coverage, algorithm, delivery format (document retrieval), import and export function, as well as ease of use.     

2. Selecting the algorithm and setting the parameters  

When conducting a sequence search, the search universe is enormous. By using various databases, the searcher can be confident of coverage. However, the enormity of data to be analyzed can make the analysis activity a daunting task.  

A good search requires selecting the right parameters. Various databases have different syntaxes towards creating the search query. This is often a complicated step. The query length and extent of coverage (closest or divergent) plays a critical role in the parametric selection process (e.g. matrices, gap penalties, e-value). For example, we would look for the following parameters in each of these searches: 

  • Identity search: Exact, Sub-sequence, Motif, Uncommon sequence 
  • Similarity search (percent of sequence identity): BLAST 

It is very important for a searcher to carefully tackle sequences with low complexity regions (e.g. poly-A tail or proline rich regions) as it could lead to incorrect/biased output. Unlike other types of searches, a progressive search strategy is useful in sequence searches as it aims to address the objective in a structured manner. In this case, the searcher would start with 100% identity and drop this percentage for shorter sequencesAs we scope the project, the searcher must consider assembling multiple search concepts. For example, these search concepts might include variable regions (VH and VL) peptide sequence, DNA sequence, and CDR regions for antibody-based sequence search.        

3. Reinforcing the search strategy by including keyword-based searches 

Often, in the process of evaluating the search results, the searcher must streamline the evaluation process by implementing supplementary keyword-based searches. This can be done using various techniques such as: 

  • features of the inventive concept 
  • names of proteins and genes or their biological targets, etc. 
  • the functional aspects of the biomolecule 
  • application areas 

This step also is extremely important when the search sequence yields a huge data set which often becomes difficult to evaluate. At this point, the searcher must implement a logical process to filter the data set to evaluate a focused set of records.  

4. Protein search over nucleotide search  

Searching for both amino acid and nucleotide sequence helps to ensure a comprehensive search. However, amino acid sequence search is more sensitive than a nucleotide sequence, and hence considered effective. Sensitivity for nucleotide search decreases primarily due to the presence of non-coding regions or silent mutations which further leads to non-relevant hits in the dataset. The percentage sequence identity/ homology alignment score between nucleotide sequences is less than at the protein level. Therefore, protein search is recommended over nucleotide search, while searching for related sequences in various search databases. 

5. Access to up-to-date data 

Any search output is only as good as the input that is processed. Analyzing outdated data will result in an incomplete assessment. Genome sequencing is a dynamic subject. Every day, the data universe keeps expanding as additional data sources, and new scientific and patent literature are added. The higher the number of sequences involved in the search process, the higher the risk will be. Thus, it is important to have regular validations and updates to enhance the robustness of the search process.   

Sequence Search – Can we do it? 

It’s doable! It’s manageable! It’s reliable!  

Sequence searching is a very niche segment and forms an integral part of almost all spheres of the life-sciences domain ranging from Therapeutics, Diagnostics, and fundamental research.  

An experienced searcher can add significant value in the search process, and this acts as a big differentiator. With the analytical skills and judgement of the searcher, the huge sea of database results is crystallized into actionable insights.  

By leveraging these guidelines and establishing clear search objectives from the start, (understanding the “what” of the search) searchers can achieve high levels of accuracy and recall. Relying only on databases (commercial, public) and tools to manipulate sequence data may not be enough.  

Understanding the claim language is critical in assessing the relevancy of results. In fact, high-quality sequence search relies on the searcher’s claim interpretation skills. In our next blog, we‘ll discuss how a searcher interprets claim language and effectively analyses various algorithms while conducting sequence searches.  

Click the link to learn how to interpret claim language in Priyanka’s second blog post: Sequence Searches: How to Interpret Claim Language?

Priyanka Paul

Priyanka Paul is an associate vice president with the IP and R&D Solutions at Evalueserve, leading a team of Life Sciences and Healthcare professionals. Her expertise spans the entire lifecycle of research: from understanding client business needs; gathering, analyzing, and presenting quantitative and qualitative data to deliver strategic insights; and collaboratively developing innovative solutions to meet business challenges that aid life sciences & healthcare outcomes.
An avid traveller, Priyanka loves going on long drives and visiting new places, and enjoying local delicacies. Equally enthusiastic to explore the world of IP research, Priyanka looks forward to using the Information Adventurers blog to share ideas and invite readers to join the conversation to share their experiences, learnings, and perspectives on this topic.

Latest Posts