2015-12-18

What BLAST's max-target-sequences doesn't do


This is a short post to highlight a scary BLAST+ -max_target_seqs bug found and reported by Sujai Kumar, which he discovered in the course of working on some puzzling Blobtools output while analysing the tardigrade genome.

In essence we, and probably most BLAST+ users, had assumed the -max_target_seqs command line option was applied after the search was finished prior to output, and likewise for the plain text output's -num_descriptions and -num_alignments (see also #5 on my Christmas 2014 BLAST wish list).

$ blastp -help
...
-max_target_seqs <Integer, >=1>
Maximum number of aligned sequences to keep
Not applicable for outfmt <= 4
Default = `500'
* Incompatible with: num_descriptions, num_alignments
...

Then Sujai found a reproducible counter example contradicting this mental model, where the top 20 hits switched from all being Eukaryotes to all Bacteria depending on -max_target_seqs.

The NCBI reply by email was:
Hello,
Thank you for the report. We don't consider this a bug, but I agree that we should document this possibility better. This can happen because limits, including max target sequences, are applied in an early ungapped phase of the algorithm, as well as later. In some cases a final HSP will improve enough in the later gapped phase to rise to the top hits. In your case, relaxing the limit to 200 appears to have allowed hits that would have been excluded in the ungapped phase at 100 max target sequences to rise.

So basically while most of the time -max_target_seqs does seem to be a final filter before the output is written, this setting actually happens much earlier in the search and can (unexpectedly) cull what turns out to be the best hit.

I agree with the NCBI BLAST team that this needs better documentation, but believe this is should be treated as bug not a feature.

Many thanks to Sujai for openly reporting this via Twitter and posting his BLAST+ bug report as a gist on GitHub where we could see and comment on the issue. Sadly I'm still waiting for an official NCBI public BLAST+ bug tracker.

Update 26 September 2018

This post was cited in Shah et al. (2018) Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. I also fixed a typo.

Update 2 November 2018

I've posted an initial followup, BLAST max alignment limits repartee - part one, which introduces a small self-contained test case, and emphasises that this issue is not specific to -max_target_seqs but also affects the limits -num_descriptions and -num_alignments used with the human readable plain text or HTML output.

In BLAST max alignment limits repartee - part two, I focus on the question of if database order is important (as claimed in Shar et al. 2018), and how exactly the internal alignment number limit works.

Update 13 November 2018

I've published BLAST max alignment limits - part three looking at the test case from Nidhi Shah, and BLAST max alignment limits - part four looking at the internal alignment number limit in the context of nucleotide databases (where composition based statistics are not used).

Update (7 December 2018)

I've published a fifth follow-up post, BLAST tie breaking by database order, looking at how the BLAST database order is defined in comparison to the FASTA file used to build a database.

Update (8 January 2019)

I've published a sixth follow-up post, "An overly aggressive optimization in BLASTN and MegaBLAST", which comes after the BLAST team's formal reply to the Shah et al. (2018) letter and BLAST+ 2.8.1 were published in late December. This update fixed oddities reported in Shah et al. (2018), which turn out to have been a complex interaction of multiple issues.

10 comments:

  1. Was it ever resolved as to whether the -num_descriptions and -num_alignments options were applied after the search as a simple filter? Or are they subject to the same bug as the -max_target_seqs command?

    ReplyDelete
    Replies
    1. Its not a bug, its a "feature". But yes, internally both are mapped to a limit on the number of alignments, and this same surprising behaviour described here can be demonstrated with plain text output and -num_descriptions and -num_alignments in exactly the same as the computer readable formats like tabular with -max_target_seqs. I am working on a followup post of some sort...

      Delete
  2. Oh dear, so this "feature" is demonstrated in the text output? I routinely use the text as a quick visual scan and limit both to 1 to the see "best" match.

    ReplyDelete
  3. I'm confused by the recent Shah et al paper. It describes a problem in which results depend on the order in which sequences appear in the database, and cites the Sujai Kumar post and the blog above as the earliest reports of it.

    But the problem in the paper is simply not the same as the problem blogged here - which is about the stage at which filtering occurs. For any given alignment limit, you should get the same results irrespective of the order in which the sequences occur in the database - and as far as I can see, you do.

    That is - I acknolwedge the existence of the problem you've blogged, and I've had some results which look compatible with this (and I totally agree, it's annoying - and worthy of Bug status). Also, just running blast with a very large -max_target_seqs makes it run many times longer than normal, which indicates that the filtering is unlikely to be occurring right at the end.

    But I can't reproduce the other (Shah et al) problem - i.e. first N hits which pass, fill the N slots and later hits lose out, even if they are better. Maybe it's a verstion thing, but my results refute it (current version 2.7.1+ and a much older one, 2.2.9+; maybe these post-date/pre-date the fixing/introduction of the Shah-reported bug?). That is, if I limit the hit-count to 10, I still get the same top 10 - even though the 11th hit (as shown by alternative searches with larger hit-counts) occurs before all those 10 in the database. That's also true of hits lower down. So if I limit to 10000, 250 or 10 hits, I still get the same top 10, in the same order; and in each case the top hit is the same as if I use -max_target_seqs 1.

    Which is reassuring, but I'm now wondering what's going on - did the 'first N hits' bug exist but is now fixed? I've only been testing using a very small DB (relatively speaking; but still much larger than the largest hit-count limit I imposed)- but the described issues shouldn't be affected by DB size.

    Definitely looking forward to your update! :-)

    ReplyDelete
    Replies
    1. I'm pretty sure that some of what Shah et al (2018) report is at best poorly worded, and has confused the issue. It is possible that they have stumbled on a separate issue, but they didn't include a test case. In more positive news, I've been able to recreate a small test case based on Sujai's original test case (which relied on an older version of the NR database which is no longer available), and that is going to be a large part of my followup.

      Delete
    2. I did wonder whether it was a wording issue - i.e. by their "the first N hits" perhaps they meant "the best N hits of the first round"; but the paper is unequivocal that the results depend on the order of sequences in the database. By the way, I've realised that the current version of NCBI Blast is only about a week old - so I wondered if it were possible that the 'order matters' bug might have been fixed there, but there's no mention of it in the changelog.

      Delete
    3. Sorry, to avoid creating more confusion - I should scrub my comment about the age of the current Blast release...seems I forgot what year it is! It's just over a year old.

      Delete
    4. The next post is up at https://blastedbio.blogspot.com/2015/12/blast-max-target-sequences-bug.html along side a self-contained small test case at https://github.com/peterjc/blast_max_target_seqs

      Delete
    5. And part two which https://blastedbio.blogspot.com/2018/11/blast-max-alignment-limits-repartee-two.html is up, asking if database order is important (as claimed in Shar et al. 2018), and how exactly the internal alignment number limit works.

      Delete
  4. Hi,

    Thank you for this post!

    So, if max_target_seqs doesn't give you the best hit, how can I do it to get the best hit?

    Thank you!

    ReplyDelete