fasta - Perl Program error -
i wrote perl program takes excel sheet (coverted text file changing extension .xls .txt) , sequence file input. excel sheet contains start point , end point of area in sequence file (along 70 flanking values on either side of match area) needs cut , extracted third output file. there 300 values. program reads in start point , end point of sequence needs cut each time repeatedly tells me value outside length on input file when isn't. cant seem fixed
this program
use strict; use warnings; $blast; $i; $idline; $sequence; print "enter blast result file name:\t"; chomp( $blast = <stdin> ); # blast result file name print "\n"; $database; print "enter gene list file name:\t"; chomp( $database = <stdin> ); # sequence file print "\n"; open in, "$blast" or die "can not open file $blast: $!"; @ids = (); @seq_start = (); @seq_end = (); while (<in>) { #spliting result file based on each tab @feilds = split( "\t", $_ ); push( @ids, $feilds[0] ); #copying name of sequence #coping 6th tab value of result start point of value should cut. push( @seq_start, $feilds[6] ); #coping 7th tab value of result file end point of value should cut. push( @seq_end, $feilds[7] ); } close in; open out, ">result.fasta" or die "can not open file $database: $!"; ( $i = 0; $i <= $#ids; $i++ ) { ($sequence) = &block( $ids[$i] ); ( $idline, $sequence ) = split( "\n", $sequence ); #extracting sequence start point end point $seqlen = $seq_end[$i] - $seq_start[$i] - 1; $nucleotides = substr( $sequence, $seq_start[$i], $seqlen ); #storing extracted substring $sequence $nucleotides =~ s/(.{1,60})/$1\n/gs; print out "$idline\n"; print out "$nucleotides\n"; } print "\nextraction completed..."; sub block { #block id storage first tab in blast output file. $id1 = shift; print "$id1\n"; $start = (); open in3, "$database" or die "can not open file $database: $!"; $blockseq = ""; while (<in3>) { if ( ( $_ =~ /^>/ ) && ($start) ) { last; } if ( ( $_ !~ /^>/ ) && ($start) ) { chomp; $blockseq .= $_; } if (/^>$id1/) { $start = $. - 1; $blockseq .= $_; } } close in3; return ($blockseq); }
blast result file: http://www.fileswap.com/dl/ws7ehftejp/
sequence file: http://www.fileswap.com/dl/lpwugh2okm/
error
substr outside of string @ nucleotide_extractor.pl line 39.
use of uninitialized value $nucleotides in substitution (s///) @ nucleotide_extractor.pl line 41.
use of uninitialized value $nucleotides in concatenation (.) or string @ nucleotide_extractor.pl line 44.
any appreciated , queries invited
there several problems existing code, , ended rewriting script while fixing errors. implementation isn't efficient opens, reads, , closes sequence file every id in excel sheet. better approach either read , store data sequence file, or, if memory limited, go through each entry in sequence file , pick out corresponding data excel file. better off using hashes, instead of arrays; hashes store data in key -- value pairs, easier find you're looking for. have used references throughout, make easy pass data , out of subroutines.
if not familiar perl data structures, check out perlfaq4 , perldsc, , perlreftut has information on using references.
the main problem existing code subroutine sequence fasta file not returning anything. idea put plenty of debugging statements in code ensure doing think doing. i've left in debugging statements commented them out. i've copiously commented code changed.
#!/usr/bin/perl use strict; use warnings; # enables 'say', prints out text , adds carriage return use feature ':5.10'; # useful module dumping out data structures use data::dumper; #my $blast = 'infesmall.txt'; print "enter blast result file name:\t"; chomp($blast = <stdin>); # blast result file name print "\n"; #my $database = 'infe.fasta'; print "enter gene list file name:\t"; chomp($database = <stdin>); # sequence file print "\n"; open in,"$blast" or die "can not open file $blast: $!"; # instead of using 3 arrays, let's use hash reference! # each id, want store start , end point. that, # we'll use hash of hashes. start , end information in 1 # hash reference: # { start => $fields[6], end => $fields[7] } # , use hashref value in hash, key # id, $fields[0]. means can access start or end data using # code this: # $info->{$id}{start} # $info->{$id}{end} $info; while(<in>){ #splitting result file based on each tab @fields = split("\t",$_); # add data our $info hashref id key: $info->{ $fields[0] } = { start => $fields[6], end => $fields[7] }; } close in; #say "info: " . dumper($info); # read sequence info fasta file $sequence = read_sequences($database); #say "data read_sequences:\n" . dumper($sequence); $out = 'result.fasta'; open(out, ">" . $out) or die "can not open file $out: $!"; foreach $id (keys %$info) { # check whether sequence exists if ($sequence->{$id}) { #extracting sequence start point end point $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1; #say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end}; #storing extracted substring $sequence $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen); $nucleotides =~ s/(.{1,60})/$1\n/gs; #say "nucleotides: $nucleotides"; print out $sequence->{$id}{header} . "\n"; print out "$nucleotides\n"; } } print "\nextraction completed..."; sub read_sequences { # fasta file $fasta_file = shift; open in3, "$fasta_file" or die "can not open file $fasta_file: $!"; # initialise 2 variables. store our sequence data in $fasta # , use $id track current sequence id # $fasta hash this: # $fasta = { # 'gi|7212472|ref|nc_002387.2' => { # header => '>gi|7212472|ref|nc_002387.2| phytophthora...', # seq => 'ataaaataatatgaataaattaaaaccaagaaataaaatatgtt...', # } #} ($fasta, $id); while(<in3>){ chomp; if (/^>/) { if (/^>(\s+) /){ # header line sequence info. $id = $1; # save data $fasta hash, keyed seq id # we're going build entry go along # set header current line $fasta->{ $id }{ header } = $_; } else { # no id found! erk. emit error , undef $id. warn "formatting error: $_"; undef $id; } } ## ensure we're getting sequence lines... elsif (/^[atgc]/) { # if $id not defined, there's weird going on, # don't save sequence. in correctly-formatted file, # should not issue. if ($id) { # if $id set, add line sequence. $fasta->{ $id }{ seq } .= $_; } } } close in3; return $fasta; }
Comments
Post a Comment