OK. No probs.
The way we can identify the word to remove is by the ">Vibrio_cholerae_" - which is common to all of them.
So we just need to come up with a regular expression that will describe that.
We want to take everything from ">Vibrio_cholerae_" up to the next space character.
So I think this regex will work
>Vibrio_cholerae_[^ ]*
.
The
[^ ]*
should forward match to the space.
And we'll replace that pattern with
>
, including a space, to keep the
>
separate from whatever other data is on that first line (if any!).
We'll use
sed
to perform the edits.
I'm pretty certain it will work. But to make sure, we'll test the regex is correct, without actually editing any files.
So pick one of your
.fasta
files and carefully type (or copy/paste) the following command:
Bash:
sed 's/>Vibrio_cholerae_[^ ]*/> /g' filename.fasta | head -n 2
Where
filename.fasta
is the actual name of the
.fasta
file you want to test with.
That will attempt to replace ">Vibrio_cholerae_....." in the first line with "> " and will show you the first two lines in the processed file.
If the output of the first line doesn't look right, let me know what you see and how you actually want it to appear and I'll see if I can tweak things.
But if you're happy with what you see, choose
ONE of the following commands to edit
all of the .fasta files in one go.
But before doing so, please read the rest of my post!
Bash:
sed -i.bak -s 's/>Vibrio_cholerae_[^ ]*/> /g' *.fasta
OR
Bash:
sed -i -s 's/>Vibrio_cholerae_[^ ]*/> /g' *.fasta
The -s flag tells sed to treat each file as a separate file, rather than as one massive, single stream of data.
The -i flag tells sed to edit the files in-place - so we aren't going to redirect the output to a new file, we're going to overwrite the original files themselves.
If the -i flag has a file extension specified immediately after it, then sed creates a backup of the original files with that extension appended to the end.
The first sed command edits all of the .fasta files in place AND creates a backup copies of the original files, with an appended .bak extension.
The second command edits all .fasta files in place, but
does NOT create a backup.
The first option is safer, because it creates a backup of your original data files.
Whereas the second one does not. So the first option might be the best one to take.
If you don't need the backup files afterwards, you could always remove them with rm.
e.g.
rm *.fasta.bak
.
Now you've read the whole post - go ahead and pick one of the final
sed
commands to edit the files.
Anyway, hopefullly that helps!