perl basic fasta header and sequence separation

I have a fasta file formatted like the below

>gi|84341511|gb|DU991381.1| KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence
ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGT
AAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGG
AGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTAC
GAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAA
CTTCAAAGTAAACAATGTGTTGTGTC
>gi|84341510|gb|DU991380.1| KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence
GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTA
TCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCC
CAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTT
ATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF

From the header I have separated ID as 4th position(DU991381.1) separated by a pipeline (|) and info (E.g. KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence) by spliting the header.

However I cannot make the sequences in a line and push them to an array.

This is my code

#!/usr/bin/perl

use warnings;
use strict;

my $stData = "C:\\data.txt";

open (DATA, $stData);

my @stID   = "";
my @stInfo = "";
my @stSeq  = "";

while (my $stLine = <DATA>)
{
    chomp($stLine);
    if ($stLine =~ /^>+/)
    {
        my @lArray = split/\|/,$stLine;
        push (@stID,$lArray[3]);
        push (@stInfo,$lArray[4]);
    }
    else #This is the part I cannot figure out
    {
        my $stSeq = $stLine.$stLine;
        print $stSeq,"\n";
    }
}


close (DATA);

So the outcome that I want is like this, an array with sequences in a line.

@stSeq  = [ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGTAAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGGAGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTACGAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAACTTCAAAGTAAACAATGTGTTGTGTC, GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTATCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCCCAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTTATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF]

Please, help me out! Cheers.

Answers


...

my @stID   = (); # empty arrays
my @stInfo = ();
my @stSeq  = ();
my $tmpStSeq = ""; # temporary empty string


while (my $stLine = <DATA>) {
    chomp($stLine);
    if ($stLine =~ /^>+/) {
        my @lArray = split/\|/,$stLine;
        push (@stID,$lArray[3]);
        push (@stInfo,$lArray[4]);

        if ($tmpStSeq ne "") { #previous seq is pushed when a next ">" is found
          push @stSeq, $tmpStSeq;
          $tmpStSeq = ""; # and previous accumulations are flushed
        }
        next; # adding this you avoid `else block`
    }
    # else {
    $tmpStSeq .= $stLine; # accumulates into temporary (of course, no `my` needed)
    # }        
}

if ($tmpStSeq ne "") { # pushes last seq into stSeq
  push @stSeq, $tmpStSeq;
  $tmpStSeq = "";
}

# here you can use your arrays.

...

Maybe using an array of hashes would be better.


I'd do:

my %data;
my ($id, $info);
while (my $stLine = <DATA>) {
    chomp($stLine);
    if ($stLine =~ /^>+/) {
        ($id, $info) = (split/\|/,$stLine)[3,4];
    }
    else {
        $data{$id}{$info} .= $stLine;
    }
}
dump%data;

__DATA__
>gi|84341511|gb|DU991381.1| KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence
ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGT
AAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGG
AGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTAC
GAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAA
CTTCAAAGTAAACAATGTGTTGTGTC
>gi|84341510|gb|DU991380.1| KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence
GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTA
TCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCC
CAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTT
ATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF

output:

(
  "DU991381.1",
  {
    " KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence" => "ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGTAAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGGAGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTACGAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAACTTCAAAGTAAACAATGTGTTGTGTC",
  },
  "DU991380.1",
  {
    " KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence" => "GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTATCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCCCAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTTATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF",
  },
)

while (my $stLine = <DATA>)
{
    chomp($stLine);
    if ($stLine =~ /^>+/)
    {
        my @lArray = split/\|/,$stLine;
        push (@stID,$lArray[3]);
        push (@stInfo,$lArray[4]);
        print "$stSeq\n";           # You could print here to get the whole last
                                    # seq before you start a new one.
                                    # Of course this would be the one from before
                                    # the header parsed here.

        $stSeq = "";                # Start a new sequence when you hit a header
    }
    else #This is the part I cannot figure out
    {
        $stSeq = $stSeq.$stLine;    # No my here and append to the sequence so far.
        print $stSeq,"\n";          # This will of course only print the seq so far.
                                    # Probably the above print is better.
    }
}

Need Your Help

Is the data structure union sockunion only for FreeBSD or also for Linux?

c linux sockets freebsd

Is the data structure union sockunion only for FreeBSD or is it also valid for Linux? If its only for FreeBSD then what is the equivalent in Linux?

JDBC BigDecimal with SQLite

sqlite jdbc bigdecimal

I am trying to recieve a Big Decimal from an SQL table containing DECIMAL(13,4) using the following code: