Concatenating Multiple .fasta Files
I'm trying to concatenate hundreds of .fasta files into a single, large fasta file containing all of the sequences. I haven't found a specific method to accomplish this in the forums. I did come across this code from http://zientzilaria.heroku.com/blog/2007/10/29/merging-single-or-multiple-sequence-fasta-files, which I have adapted a bit.
Fasta.py contains the following code:
class fasta: def __init__(self, name, sequence): self.name = name self.sequence = sequence def read_fasta(file): items =  index = 0 for line in file: if line.startswith(">"): if index >= 1: items.append(aninstance) index+=1 name = line[:-1] seq = '' aninstance = fasta(name, seq) else: seq += line[:-1] aninstance = fasta(name, seq) items.append(aninstance) return items
And here is the adapted script to concatenate .fasta files:
import sys import glob import fasta #obtain directory containing single fasta files for query filepattern = input('Filename pattern to match: ') #obtain output directory outfile = input('Filename of output file: ') #create new output file output = open(outfile, 'w') #initialize lists names =  seqs =  #glob.glob returns a list of files that match the pattern for file in glob.glob(filepattern): print ("file: " + file) #we read the contents and an instance of the class is returned contents = fasta.read_fasta(open(file).readlines()) #a file can contain more than one sequence so we read them in a loop for item in contents: names.append(item.name) seqs.append(item.sequence) #we print the output for i in range(len(names)): output.write(names[i] + '\n' + seqs[i] + '\n\n') output.close() print("done")
It is able to read the fasta files but the newly created output file contains no sequences. The error I receive is due to the fasta.py, which is beyond my capability to mess with:
Traceback (most recent call last): File "C:\Python32\myfiles\test\3\Fasta_Concatenate.py", line 28, in <module> contents = fasta.read_fasta(open(file).readlines()) File "C:\Python32\lib\fasta.py", line 18, in read_fasta seq += line[:-1] UnboundLocalError: local variable 'seq' referenced before assignment
Any suggestions? Thanks!
I think using python for this job is overkill. On the command line, a quick way to concatenate single/multiple fasta files with the .fasta or .fa extensions is to simply:
cat *.fa* > newfile.txt
The problem is in fasta.py:
else: seq += line[:-1] aninstance = fasta(name, seq)
Try initializing seq before at the start of read_fasta(file).
EDIT: Further explanation
When you first call read_fasta, the first line in the file does not start with >, thus you append the first line to the string seq which has not be initialized yet (not even declared): you are appending a string (the first line) to a null value. The error present in the stack explains the problem:
UnboundLocalError: local variable 'seq' referenced before assignment
Not a python programer but it seems that question code tries to condense the data for each sequence in a single line and also separate sequence with a blank line.
>seq1 00000000 11111111 >seq2 22222222 33333333
>seq1 0000000011111111 >seq2 2222222233333333
If this is in fact needed the cat based solution above would not work. Otherwise the cat is the simplest and most effective solution.
For windows OS via command prompt: (Note-folder should contain only required files) :
copy *.fasta **space** final.fasta
The following ensures that new files always start on a new line:
$ awk 1 *.fasta > largefile.fasta
The solution using cat might fail on that:
$ echo -n foo > f1 $ echo bar > f2 $ cat f1 f2 foobar $ awk 1 f1 f2 foo bar