Delete intermediate files

A brief introduction to best practices in deleting intermediate files.

Which files can be deleted?

Many bioinformatics workflows involve multiple steps of processing on certain files (e.g., filtering, sorting, annotation).

Those steps generate and store multiple copies of those files at the various steps of processing.

In such cases, you should consider deleting all but the final copy of that file.

For instance:

# Run HISAT2 to map reads
hisat2 -f \
  -x $HISAT2_HOME/example/index/22_20-21M_snp \
  -U $HISAT2_HOME/example/reads/reads_1.fa \
  -S example1.sam
# Run samtools view to convert from SAM to BAM
samtools view -bS example1.sam > example1.bam
# Run samtools sort to sort mapped reads
samtools sort example1.bam -o example1.sorted.bam
# Remove intermediate files
rm example1.sam example1.bam

In particular:

  • The rm command is used to delete all files except for the final example1.sorted.bam.

Which files should not be deleted?

Small files

It is usually unecessary to delete small intermediate files.

The benefit of disk space recovered is usually negligible against the time and effort necessary to regenerate those files if necessary.