2015-07-08

BLAST XML 2 - does the sequel live up to my hopes?

Last year I wrote a blog post "BLAST XML output needs more love from NCBI", and in the numerous updates to this, tracked the NCBI outreach and then release of BLAST XML 2.

The new output format was included in BLAST+ 2.2.31 as output format 15, without any kind of beta release for user feedback. Later than planned, I was able to give this a try during the Galaxy Community Conference 2015 Hackathon. Sadly the worries voiced on the OBF Bio* mailing lists were well founded.

In part because XML is so verbose, it is nice to be able to parse it as a stream - meaning capturing the output via stdout and Unix pipes. That appears to be "broken". In fact, producing a bundle of XML files using XInclude seems a recipe for trouble.

Setting up the example

Here I am using the sample files from this repository, first an example using -outfmt 6, the concise tabular output:

$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 6
gi|57163783|ref|NP_001009242.1| sp|P08100|OPSD_HUMAN 96.55 348 12 0 1 348 1 348 0.0 701
gi|3024260|sp|P56514.1|OPSD_BUFBU sp|P08100|OPSD_HUMAN 83.33 354 53 2 1 354 1 348 0.0 605
...

Now, using classic BLAST XML,

$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 5
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastp</BlastOutput_program>
  <BlastOutput_version>BLASTP 2.2.31+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>four_human_proteins.fasta</BlastOutput_db>
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>gi|57163783|ref|NP_001009242.1| rhodopsin [Felis catus]</BlastOutput_query-def>
  <BlastOutput_query-len>348</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>0.0001</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>Query_1</Iteration_query-ID>
  <Iteration_query-def>gi|57163783|ref|NP_001009242.1| rhodopsin [Felis catus]</Iteration_query-def>
  <Iteration_query-len>348</Iteration_query-len>
  ...
</Iteration>
<Iteration>
  <Iteration_iter-num>2</Iteration_iter-num>
  <Iteration_query-ID>Query_2</Iteration_query-ID>
  <Iteration_query-def>gi|3024260|sp|P56514.1|OPSD_BUFBU RecName: Full=Rhodopsin</Iteration_query-def>
  <Iteration_query-len>354</Iteration_query-len>
  ...
</Iteration>
...
</BlastOutput_iterations>
</BlastOutput>

There is one <Iteration> block per query.

$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 5 | grep "<Iteration>"
<Iteration>
<Iteration>
<Iteration>
<Iteration>
<Iteration>
<Iteration>

Invalid XML to stdout

Finally, using the new BLAST XML v2 as implemented in BLAST+ 2.2.31,

$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14
<?xml version="1.0"?>
<BlastOutput2
    xmlns="http://www.ncbi.nlm.nih.gov"
    xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
    xs:schemaLocation="http://www.ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov/data_specs/schema_alt/NCBI_BlastOutput2.xsd"
>
  <report>
    <Report>
    ...
    </Report>
  </report>
</BlastOutput2>
<?xml version="1.0"?>
<BlastOutput2
    xmlns="http://www.ncbi.nlm.nih.gov"
    xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
    xs:schemaLocation="http://www.ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov/data_specs/schema_alt/NCBI_BlastOutput2.xsd"
>
  <report>
    <Report>
    ...
    </Report>
  </report>
</BlastOutput2>
...

Anyone see the problem right away? There are multiple <?xml version="1.0"?> lines though-out the output - this is not valid XML but the concatenation of XML files.

$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 | grep "xml version"
<?xml version="1.0"?>
<?xml version="1.0"?>
<?xml version="1.0"?>
<?xml version="1.0"?>
<?xml version="1.0"?>
<?xml version="1.0"?>

Does anyone find this familiar? Early versions of the XML BLAST output from "legacy" BLAST had the same problem (before the <Iteration> element was repurposed as a per-query block).

The reason behind this is perhaps somewhat explained if we ask BLAST+ to write to a file instead of defaulting to stdout:

$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out example.xml
$ cat example.xml 
<?xml version="1.0"?>
<BlastXML
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xi="http://www.w3.org/2003/XInclude">
 <xi:include href="example_1.xml"/>
 <xi:include href="example_2.xml"/>
 <xi:include href="example_3.xml"/>
 <xi:include href="example_4.xml"/>
 <xi:include href="example_5.xml"/>
 <xi:include href="example_6.xml"/>
</BlastXML>

It appears that what is being written to stdout by default is the concatenation of these six child files (at least in this example).

Problems with paths

I also found a clear bug in the new BLAST XML v2 output, consider this example:

$ mkdir output
$ ls output/
$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out output/example.xml
$ ls output/
example.xml example_1.xml example_2.xml example_3.xml example_4.xml example_5.xml example_6.xml
$ cat output/example.xml 
<?xml version="1.0"?>
<BlastXML
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xi="http://www.w3.org/2003/XInclude">
 <xi:include href="output/example_1.xml"/>
 <xi:include href="output/example_2.xml"/>
 <xi:include href="output/example_3.xml"/>
 <xi:include href="output/example_4.xml"/>
 <xi:include href="output/example_5.xml"/>
 <xi:include href="output/example_6.xml"/>
</BlastXML>

The problem here is that the master XML file contains include links with paths relative to where BLAST was run - not relative to the master XML file itself. This would break parsing the output

Similarly, if I use an absolute path to the output parameter, then the Include lines also use absolute paths. Again, since they are output in the same folder as the master XML file, the include paths should simply be relative paths (otherwise the XML file set cannot be moved without breaking the links):

$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out /tmp/example.xml
$ cat /tmp/example.xml 
<?xml version="1.0"?>
<BlastXML
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xi="http://www.w3.org/2003/XInclude">
 <xi:include href="/tmp/example_1.xml"/>
 <xi:include href="/tmp/example_2.xml"/>
 <xi:include href="/tmp/example_3.xml"/>
 <xi:include href="/tmp/example_4.xml"/>
 <xi:include href="/tmp/example_5.xml"/>
 <xi:include href="/tmp/example_6.xml"/>
</BlastXML>

Problems with permissions

BLAST+ 2.2.31 will auto-name the child XML files for each query based on the requested output file - but there is no guarantee that those files won't already exist, or that they can even be created. Here's a pathological example using a special "filename":

$ ~/Downloads/ncbi-blast-2.2.31+/bin/blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out /dev/stdout
Error: [blastp] Cannot open output fileNCBI C++ Exception:
    T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file

Error: [blastp] Cannot open output fileNCBI C++ Exception:
    T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file

Error: [blastp] Cannot open output fileNCBI C++ Exception:
    T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file

Error: [blastp] Cannot open output fileNCBI C++ Exception:
    T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file

Error: [blastp] Cannot open output fileNCBI C++ Exception:
    T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file

Error: [blastp] Cannot open output fileNCBI C++ Exception:
    T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file

<?xml version="1.0"?>
<BlastXML
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xi="http://www.w3.org/2003/XInclude">
 <xi:include href="/dev/stdout_1.xml"/>
 <xi:include href="/dev/stdout_2.xml"/>
 <xi:include href="/dev/stdout_3.xml"/>
 <xi:include href="/dev/stdout_4.xml"/>
 <xi:include href="/dev/stdout_5.xml"/>
 <xi:include href="/dev/stdout_6.xml"/>
</BlastXML>

In this example, /dev/stdout is a Linux convention for writing to stdout (useful with command line tools which otherwise would only output to a file). I've highlighted in red the errors (on stderr) from BLAST+ 2.2.31 naively generating child XML filenames from the -out parameter.

There's another bug here in that BLAST did not abort, indeed it gave a zero return code meaning success. You can trigger this in other ways:

$ ls example*.xml
ls: example*.xml: No such file or directory
$ touch example_2.xml
$ chmod a-w example_2.xml 
$ ~/Downloads/ncbi-blast-2.2.31+/bin/blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 14 -out example.xml && echo "Returned $?"
Error: [blastp] Cannot open output fileNCBI C++ Exception:
    T0 "/Users/coremake/release_build/build/PrepareRelease_IntelMAC_JSID_01_80346_130.14.18.6_9008__PrepareRelease_IntelMAC_1433256305/c++/compilers/unix/../../src/algo/blast/format/blastxml2_format.cpp", line 751: Error: BLASTFORMAT::ncbi::BlastXML2_FormatReport() - Cannot open output file

Returned 0
$ ls -l example*.xml
-rw-r--r--  1 peterjc  staff   340  5 Jul 17:03 example.xml
-rw-r--r--  1 peterjc  staff  4079  5 Jul 17:03 example_1.xml
-r--r--r--  1 peterjc  staff     0  5 Jul 17:02 example_2.xml
-rw-r--r--  1 peterjc  staff  4026  5 Jul 17:03 example_3.xml
-rw-r--r--  1 peterjc  staff  4019  5 Jul 17:03 example_4.xml
-rw-r--r--  1 peterjc  staff  4070  5 Jul 17:03 example_5.xml
-rw-r--r--  1 peterjc  staff  4099  5 Jul 17:03 example_6.xml

In this case BLAST+ 2.2.31 was unable to create one of the child files, but again despite printing an error message to stderr, lies and returns a success return code.

Why use XML includes?

Since I first read the BLAST XML v2 proposal, I have yet to come up with a compelling reason for the NCBI to be using XML includes and all the overheads of multi-file output. Would it make sense to get rid of this in the next BLAST+ release and simply produce a single (large) XML file even for multiple-query BLAST searches?

Update (12 November 2015)


The NCBI are going to offer the new BLAST+ XML and JSON output in a single-file mode with the next release.

Update (31 December 2015)

BLAST+ 2.3.0 (released 21 December 2015) appears to fix some/all of bugs logged here, and offers a single-file mode. I have not yet tested this.

No comments:

Post a Comment