I am watching ESPN, chatting with March Hare, petting Dormouse, blogging,
and running a program to analyze sets of data all at the same time. How's
that for multi-tasking?!
As has happened often recently, I am amazed tonight by what computers and software can do for biologists. Seriously, how else would I be able to align two DNA sequences, each of which is over 120,000 bases long, with each other without going completely blind?
Given the trend toward doing experiments that generate ginormous datasets--microarrays, mass sequencing, etc.--it seems a shame to me that so many biologists (including myself) don't really know much about the programs we use to analyze these data. We know just enough to get the programs to do what we need them to do, but not enough to understand how the algorithms work or how to tweak the parameters to generate better analyses. In this age of increasing reliance on computational biology, perhaps biology graduate programs ought to require some kind of remedial bioinformatics course, no?
This point was driven home earlier today by a paragraph in a
paper I was reading. The paper was about a comparison of the efficacy and sensitivity of several different computer programs for detecting repeat elements either in sequencing reads or in assembled contigs. In the discussion section, the authors addressed the fact that they had compared the various programs using the default parameters instead of trying to optimize the settings:
Of note, each tool was evaluated using its default parameters. We chose not to conduct tool optimization because, in our experience, it is very common for biologists to operate bioinformatics tools using default parameters. Some likely reasons why optimization is often avoided are as follows:
- Many biologists have little or no understanding of the algorithms and programming behind computational tools and thus do not feel comfortable changing program parameters.
- Public domain tools rarely come with documentation that can be easily understood by those lacking experience in computational biology.
- Because software developers cannot anticipate every dataset and/or application on which their tool may be used, they often provide only vague suggestions as to how optimization might be conducted.
- The act of optimizing a tool for a particular dataset or application can be very difficult and time consuming.
- Program default settings often become ‘standards’ to which researchers adhere so that they can directly compare their results with those of researchers who have used the same program in default mode in the past.
- What constitutes an ‘optimal result’ differs from user to user prompting some scientists to use the default parameters as a way to limit introduction of their own biases into results/conclusions.
So...numbers 1, 2, 5 and 6 basically translate to (1) biologists don't know shit about computer programs and have absolutely no clue how changing various parameters will alter the result, (2) we could try reading the manual but we wouldn't understand it anyway, (3) um...we'll just do what that other guy did cuz we don't know any better, and (4) if we knew how to tweak the program, we could make it give us any result we wanted...just like with PhotoShop!
I might consider being offended on behalf of biologists if all of these points weren't so true. I mean, I
did just spend almost an entire afternoon annotating a genome by copying and pasting each ORF one-by-one into the query box at Blast, right? Sigh.
So instead, I laughed my ass off at seeing our shortcomings in this area so...um...
tactfully described in an actual journal paper! Kudos to the authors for having the guts to be so candid. And the program they recommended for finding repeats had better damn well be idiot-proof and work like a charm or I'm going to need some serious help.