an important and common task in bioinformatics is renaming files. renaming a single file in linux is simple to accomplish, but renaming multiples files is often more arduous -- but it doesn't have to be. here, we will review four methods for renaming files, each one more powerful than the preceding one.
suppose, for example, that we have a set of files from the illumina platform in the fastq format. the latest version of illumina's software, casava v1.8.2, converts per-cycle basecall files into fastq files with the following naming scheme:
before we delve into the four methods of renaming files, if you would like to try these methods on the above hypothetical set of fastq files, you can create mock files by using the 'touch' command by executing the following code in a bash shell:
method 1: use the 'mv' command
the 'mv' command can be used to rename files.
method 2: write a bash script
we can wrap the 'mv' command inside of a bash script:
method 3: the rename command
the 'rename' command is a c program capable of renaming a set of files with a single command. 'rename' is installed in redhat distributions; ubuntu distributions of linux come with an alternative version of 'rename'. if your version of linux does not have this 'rename' command, you can download it and compile it from source from the util-linux package.
method 4: an alternative rename command
there is an alternative 'rename' command, which is a perl script, and can be seen as an extended version of the command from method 3; it's more powerful, among other reasons, in that it can rename multiple files using perl regular expressions.
to get started, download the perl script and add the mode execute:
suppose, for example, that we have a set of files from the illumina platform in the fastq format. the latest version of illumina's software, casava v1.8.2, converts per-cycle basecall files into fastq files with the following naming scheme:
<sample name>_<barcode sequence>_L<lane[0-7]{3}>_R<read number>_<set number[0-9]{3}>.fastq.gzso, for example, the following is a illumina-valid fastq file name:
SA1_ATCACG_L002_R1_001.fastq.gznote that a single illumina fastq file is, by default, divided into a set of files, each of which contains no more than 4 M reads per output file. the different files are distinguished by a 0-padded 3-digit set number ([0-9]{3}). accordingly, a single sample may have numerous fastq files belonging to it, like so:
sample(SA1) = { X : SA1_ATCACG_L002_R1_001.fastq.gz, SA1_ATCACG_L002_R1_002.fastq.gz,now, further suppose that we would like to replace every instance of 'SA1' with 'NA10831' (say, because when we submitted the samples for sequencing, we provided sample names in code to mask the true identity of the sample, for privacy reasons).
SA1_ATCACG_L002_R1_003.fastq.gz, SA1_ATCACG_L002_R1_004.fastq.gz,
SA1_ATCACG_L002_R1_005.fastq.gz, SA1_ATCACG_L002_R1_006.fastq.gz,
SA1_ATCACG_L002_R1_007.fastq.gz, SA1_ATCACG_L002_R1_008.fastq.gz,
SA1_ATCACG_L002_R1_009.fastq.gz, SA1_ATCACG_L002_R1_010.fastq.gz,
SA1_ATCACG_L002_R1_011.fastq.gz, SA1_ATCACG_L002_R1_012.fastq.gz,
SA1_ATCACG_L002_R1_013.fastq.gz, SA1_ATCACG_L002_R1_014.fastq.gz,
SA1_ATCACG_L002_R1_015.fastq.gz, SA1_ATCACG_L002_R1_016.fastq.gz,
SA1_ATCACG_L002_R1_017.fastq.gz, SA1_ATCACG_L002_R1_018.fastq.gz,
SA1_ATCACG_L002_R1_019.fastq.gz, SA1_ATCACG_L002_R1_020.fastq.gz,
SA1_ATCACG_L002_R1_021.fastq.gz, SA1_ATCACG_L002_R1_022.fastq.gz,
SA1_ATCACG_L002_R1_023.fastq.gz, SA1_ATCACG_L002_R1_024.fastq.gz,
SA1_ATCACG_L002_R1_024.fastq.gz, SA1_ATCACG_L002_R1_026.fastq.gz,
SA1_ATCACG_L002_R1_027.fastq.gz, SA1_ATCACG_L002_R1_028.fastq.gz,
SA1_ATCACG_L002_R1_029.fastq.gz, SA1_ATCACG_L002_R1_030.fastq.gz }
before we delve into the four methods of renaming files, if you would like to try these methods on the above hypothetical set of fastq files, you can create mock files by using the 'touch' command by executing the following code in a bash shell:
mkdir renaming_files ; cd renaming_filesdon't worry -- the files will be empty!
for l in $(seq 1 9) ; do touch SA1_ATCACG_L002_R1_00$l.fastq.gz ; done
for l in $(seq 10 30) ; do touch SA1_ATCACG_L002_R1_0$l.fastq.gz ; done
method 1: use the 'mv' command
the 'mv' command can be used to rename files.
syntax: mv [options] oldname newnameto rename the first item in our sample set, we would do the following:
mv SA1_ATCACG_L002_R1_001.fastq.gz NA10831_ATCACG_L002_R1_001.fastq.gzin order to rename all of the files in our sample set, we would have to type the 'mv' command 30 (!) times:
mv SA1_ATCACG_L002_R1_001.fastq.gz NA10831_ATCACG_L002_R1_001.fastq.gznot only is this method tedious, but it is also more error-prone than alternative methods.
mv SA1_ATCACG_L002_R1_002.fastq.gz NA10831_ATCACG_L002_R1_002.fastq.gz
mv SA1_ATCACG_L002_R1_003.fastq.gz NA10831_ATCACG_L002_R1_003.fastq.gz
mv SA1_ATCACG_L002_R1_004.fastq.gz NA10831_ATCACG_L002_R1_004.fastq.gz
mv SA1_ATCACG_L002_R1_005.fastq.gz NA10831_ATCACG_L002_R1_005.fastq.gz
mv SA1_ATCACG_L002_R1_006.fastq.gz NA10831_ATCACG_L002_R1_006.fastq.gz
mv SA1_ATCACG_L002_R1_007.fastq.gz NA10831_ATCACG_L002_R1_007.fastq.gz
mv SA1_ATCACG_L002_R1_008.fastq.gz NA10831_ATCACG_L002_R1_008.fastq.gz
mv SA1_ATCACG_L002_R1_009.fastq.gz NA10831_ATCACG_L002_R1_009.fastq.gz
mv SA1_ATCACG_L002_R1_010.fastq.gz NA10831_ATCACG_L002_R1_010.fastq.gz
mv SA1_ATCACG_L002_R1_011.fastq.gz NA10831_ATCACG_L002_R1_011.fastq.gz
mv SA1_ATCACG_L002_R1_012.fastq.gz NA10831_ATCACG_L002_R1_012.fastq.gz
mv SA1_ATCACG_L002_R1_013.fastq.gz NA10831_ATCACG_L002_R1_013.fastq.gz
mv SA1_ATCACG_L002_R1_014.fastq.gz NA10831_ATCACG_L002_R1_014.fastq.gz
mv SA1_ATCACG_L002_R1_015.fastq.gz NA10831_ATCACG_L002_R1_015.fastq.gz
mv SA1_ATCACG_L002_R1_016.fastq.gz NA10831_ATCACG_L002_R1_016.fastq.gz
mv SA1_ATCACG_L002_R1_017.fastq.gz NA10831_ATCACG_L002_R1_017.fastq.gz
mv SA1_ATCACG_L002_R1_018.fastq.gz NA10831_ATCACG_L002_R1_018.fastq.gz
mv SA1_ATCACG_L002_R1_019.fastq.gz NA10831_ATCACG_L002_R1_019.fastq.gz
mv SA1_ATCACG_L002_R1_020.fastq.gz NA10831_ATCACG_L002_R1_020.fastq.gz
mv SA1_ATCACG_L002_R1_021.fastq.gz NA10831_ATCACG_L002_R1_021.fastq.gz
mv SA1_ATCACG_L002_R1_022.fastq.gz NA10831_ATCACG_L002_R1_022.fastq.gz
mv SA1_ATCACG_L002_R1_023.fastq.gz NA10831_ATCACG_L002_R1_023.fastq.gz
mv SA1_ATCACG_L002_R1_024.fastq.gz NA10831_ATCACG_L002_R1_024.fastq.gz
mv SA1_ATCACG_L002_R1_025.fastq.gz NA10831_ATCACG_L002_R1_025.fastq.gz
mv SA1_ATCACG_L002_R1_026.fastq.gz NA10831_ATCACG_L002_R1_026.fastq.gz
mv SA1_ATCACG_L002_R1_027.fastq.gz NA10831_ATCACG_L002_R1_027.fastq.gz
mv SA1_ATCACG_L002_R1_028.fastq.gz NA10831_ATCACG_L002_R1_028.fastq.gz
mv SA1_ATCACG_L002_R1_029.fastq.gz NA10831_ATCACG_L002_R1_029.fastq.gz
mv SA1_ATCACG_L002_R1_030.fastq.gz NA10831_ATCACG_L002_R1_030.fastq.gz
method 2: write a bash script
we can wrap the 'mv' command inside of a bash script:
for file in *.fastq.gz ; dothe above script first finds every file ending in the '.fastq.gz' extension, and then uses a for loop to iterate through each item. it extracts the name of the file without the 'SA1' prefix by using the 'cut' command, and stores that result in a variable named 'suffix'. it finishes by using the 'mv' command to append the new name 'NA10831' to the stored suffix. note two things: 1) there is more than one bash script that would have accomplished this same task, and 2) this script splits the file name by underscore ("_") delimiters, which means that had the sample name 'SA1' contained an underscore, this particular script would not have worked properly. although this method is much less tedious than the previous method, there are easier methods yet.
suffix=$(echo $file | cut -d "_" -f2-)
mv $file NA10831_$suffix
done
method 3: the rename command
the 'rename' command is a c program capable of renaming a set of files with a single command. 'rename' is installed in redhat distributions; ubuntu distributions of linux come with an alternative version of 'rename'. if your version of linux does not have this 'rename' command, you can download it and compile it from source from the util-linux package.
syntax: rename pattern replacement file...this 'rename' command replaces the first occurence of pattern with replacement in a set of files' names. to use the 'rename' command to rename all of the files in our sample set:
rename SA1 NA10831 *.fastq.gzwhich replaces the first occurence of 'SA1' with 'NA10831' in all files ending in '.fastq.gz'. this command is limited, however, in what it can do.
method 4: an alternative rename command
there is an alternative 'rename' command, which is a perl script, and can be seen as an extended version of the command from method 3; it's more powerful, among other reasons, in that it can rename multiple files using perl regular expressions.
to get started, download the perl script and add the mode execute:
wget http://plasmasturm.org/code/rename/rename
chmod +x rename
syntax: rename [switches|transforms] file...since this script is feature rich, there is more than one way to accomplish the same task. one method of using this command on our sample set of files is with a regular expression:
syntax: rename [perlexpr] file...
/path/to/rename 's/SA1/NA10831/' *.fastq.gzwhich, like the 'rename' from method 3, replaces just the first occurence of 'SA1' with 'NA10831' in all files ending in '.fastq.gz'. a different way of accomplishing the same thing would have been by using the command's built-in switches:
/path/to/rename --subst SA1 NA10831 *.fastq.gzthis method uses the '--subst' switch, which reduces this rename command to the syntax of method 3. method 4 is extremely powerful (removing extensions, for example, is as easy as using the '--remove-extension' switch), and there are a variety of use cases where this script can be handy. the reader is encouraged to thoroughly read through the 'rename' manual. more use cases will be posted on this blog as they arise.