Find duplicate line starts in a text file

I found this answer (Find duplicate lines in a file and count how many time each line was duplicated?) while searching and it solves the issue of duplicate lines, but I have a peculiar issue.

I have a need to find duplicates of lines that have the same line beginnings.

For example:

2501,3,0,1,0,1457695800
2501,3,0,1,0,1457789340
2502,3,0,0,0,1457695800
2502,3,0,0,0,1457789340
2503,3,0,0,0,1457789340
2504,3,0,0,0,1457789340 
2505,3,0,0,0,1457789340

In the CSV data above, 2501 and 2502 would be duplicates if the timestamp was not there.

Is there a way to find them as duplicates by considering only the first 5 fields i.e. excluding timestamp?

Answers


I ended up finding the answer by tacking a bunch of commands together:

cat my_file.csv | perl -p -i -n -e 's/^(.*),[0-9]{10}.+?$/$1/' | sort | uniq -d

So basically, the steps are:

  1. use cat to get the contents of the file
  2. pipe it to perl and use a regular expression to get only the capturing group (in this case, everything before the timestamp)
  3. pipe the output to sort which will sort the content
  4. use uniq with -d switch to find line duplicates

If you like you can also output the result to file:

cat my_file.csv | perl -p -i -n -e 's/^(.*),[0-9]{10}.+?$/$1/' | sort | uniq -d > line_duplicates.txt

Hope this helps.


Need Your Help

Calling a Factory without typecasting the return type

java generics casting factory builder

This is the first time I am trying to make a factory.. please excuse me if I didn't use the pattern properly!

BeautifulSoup: get css classes from html

python html css beautifulsoup

Is there a way to get CSS classes from a HTML file using BeautifulSoup? Example snippet: