Joining two files based on two fields

I posted a question before a week and the answer was simply (use join):

join <(sort file1) <(sort file2) >output

to join files that have something common which is usually the first field.

I have the following two files:

genes.txt

ENSG001 ENSG002
ENSG002 ENSG001
ENSG003 ENSG004

features.txt

ENSG001 400
ENSG002 350
ENSG003 210
ENSG004 100

I need to join these two files to be like this:

output.txt

ENSG001 400 ENSG002 350
ENSG002 350 ENSG001 400
ENSG003 210 ENSG004 100

I know the answer is in join command but I can't figure out how to join based on two fields. I tried

join -j 1 <(sort genes.txt) <(sort features.txt) >attempt1.txt

but the result will looks like this:

attempt1.txt

ENSG001 ENSG002 400
ENSG002 ENSG001 350
ENSG003 ENSG004 210

I then tried

join -j 2 <(sort -k 2 genes.txt) <(sort -k 2 features.txt) >attempt2.txt

attempt2.txt is empty

Does (join) have the ability to join two files based on two fields ? If no then how can I do it ?

Answers


%features;
open $fd, '<', 'features.txt' or die $!;
while (<$fd>) {
    ($k, $v) = split;
    $features{$k} = $v;
}
close $fd or die $!;

open $fd, '<', 'genes.txt' or die $!;
while (<$fd>) {
    s/(\w+)/$1 $features{$1}/g;
    print;
}
close $fd or die $!;

Thank you all guys I have managed to answer it by tricking the problem.

First I joined the files normally, I then changed the position of first and second field, I next joined the modified output file another time with features, and finally I switched the positions of fields again.

join <(sort genes.txt) <(sort features.txt) >tmp

cat tmp | awk '{ print $2, $1, $3 }' >tmp2

join <(sort tmp2) <(sort features.txt) >tmp3

cat tmp3 | awk '{ print $2, $3, $1, $4 }' >output.txt

To the best of my knowledge, join does NOT support this. See join manpage.

However, you can accomplish this in 2 ways:

  • Turn the first space/tab in the file into a caret (or other character you will never see in the file), then use join as before which will treat the first 2 fields as 1 field:

    perl -pi -e 's/^(\S+)\s+/$1#/' file1
    perl -pi -e 's/^(\S+)\s+/$1#/' file2
    join <(sort file1) <(sort file2) >output
    tr "#" " " output > output.final
    
  • Do it in Perl. You can do

    • the blunt approach (perreal's answer: slurp in 2 files at once); this takes a lot of memory if both files are large

    • The more memory conserving approach (cdtits's answer: slurp in a smaller file, store in a hash, then apply the lookups to line-by-line read of second file)

    • For really gynormous files, do a linear approach:

      sort both files, read 1 line of each file; if they match, print the match; if not; skip 1 line in the file whose ID was smaller.


In case that "ENST" in features.txt is "ENSG", here is an awk solution that works well on given example:

awk 'BEGIN {while(getline <"features.txt") f[$1]=$2} {print $1,f[$1],$2,f[$2]}' < genes.txt

I can explain in detail if you need to.


Using perl:

use strict;
use warnings;
open GIN, "<genes.txt"    or die("genes");
open FIN, "<features.txt" or die("features");
my %relations;
my %values;
while (<GIN>) {
  my ($r1, $r2) = split;
  $relations{$r1} = $r2;
}
while (<FIN>) {
  my ($k, $v) = split;
  $values{$k} = $v;
}
for my $r1 (sort keys %relations) {
  my $r2 = $relations{$r1};
  print "$r1 $values{$r1} $r2 $values{$r2}\n"; 
}
close FIN; close GIN;

Your approach is generally right. It should be achievable by something like

join -o '1.1 2.2 1.2 1.3' <(
    join -o '1.1 1.2 2.2' -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
    sort
) <(sort features.txt)

If I place ENSG004 instead of ENST004 into features.txt I will get exactly what you are looking for:

$ join -o '1.1 2.2 1.2 1.3' <(
      join -o '1.1 1.2 2.2' -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
      sort
  ) <(sort features.txt)
ENSG001 400 ENSG002 350
ENSG002 350 ENSG001 400
ENSG003 210 ENSG004 100

There is less verbose version but there is harder to keep track of fields:

join -o '1.2 2.2 1.1 1.3' -1 2 <(
    join -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
    sort -k 2
) <(sort features.txt)

If you are going process really big data it will should work pretty effective to tens of GB (and also should be better then most of RDBMS's if features.txt and genes.txt are comparable in size):

TMP=`mktemp`
sort features.txt > "$TMP"
sort -k 2 genes.txt | join -o '1.1 1.2 2.2' -1 2 - "$TMP" | sort |
    join -o '1.1 2.2 1.2 1.3' - "$TMP"
rm "$TMP"

Need Your Help

Asp.net MVC Domain and sub domain creation

asp.net-mvc

I am using Asp.net MVC and I'm working on a project that provide to visitor register our shop my site with a sub-domain of my company website, for example http://www.shopify.com/ is website. then ...

Calling methods in super class constructor or subclass constructor?

python oop constructor

1. Passing configuration to the __init__ method which calls register implicitely: