How to count characters in the first line of a text file

jhcuarta

New Member
Joined
Aug 5, 2021
Messages
19
Reaction score
5
Credits
156
Hi

I was wondering if you could help me out to figure out how to count the characters from the first line in several text files contained in a directory
 


Beg your pardon
and print it out per file

The reason I ask this newbie question is because I'm using a tool that throws an error when the first line of the file has more than (> 38 characters). It would be awesome if the pipeline you help me out with could include in the print just those files that match this condition. Thanks ahead.

Best regards
 
Last edited:
Don't bump, you just created the topic 3 minutes ago. Be patient...
 
This single line of commands will have sed read the first line of a file called "filename", then pipe it through a second sed to remove the spaces, then pipe it through wc which counts the characters. However, wc -m also counts the newline (hard return) at the end of a line, so the result is piped through awk to remove 1 from the total to get the number of alpha-numerical and other keyboard characters.
Code:
sed -n '1p' filename |sed -n 's/ //gp' |wc -m |awk '{print($1-1)}'
 
Last edited:
Moving this to Command Line

Wizard
 
Hope the above command by @osprey works for you. I would just add that some programs count the spaces as charectors also so your 38 would be diminished by the number of spaces as well.
 
Hi
The command line doesn't work
it returns -1 for several files at the time ans a single file

sed -n '1p' *.fasta |sed -n 's/ //gp' |wc -m |awk '{print($1-1)}'
-1

sed -n '1p' 1Mo_UM.fasta |sed -n 's/ //gp' |wc -m |awk '{print($1-1)}'
-1
 
Hi
The command line doesn't work
it returns -1 for several files at the time ans a single file

sed -n '1p' *.fasta |sed -n 's/ //gp' |wc -m |awk '{print($1-1)}'
-1

sed -n '1p' 1Mo_UM.fasta |sed -n 's/ //gp' |wc -m |awk '{print($1-1)}'
-1
That's because there may be no spaces in the line and is a result of the processing of the second sed command. The explanation of what that line achieved was clear in post #4, but the nature of the text files it was being applied to wasn't defined, so the line was really an example of how to proceed rather than a precise final answer applicable to all possible cases. In this case, if the text files do not contain any spaces in the first line of the text file, then one would need a conditional statement to that effect detecting first of all whether there were spaces, and if there were, to delete them, but if there were none, to ignore the relevant sed command. With such a conditional, the code could perhaps be written more clearly as a script rather than a one liner. I'll leave it to the interested observer to achieve that relatively simple modification.
 
I would just add that some programs count the spaces as charectors also so your 38 would be diminished by the number of spaces as well.

Entirely pointless:

I believe they all do, at least under the hood? They all count it as a character, though they may not display that information anywhere?

Traditionally, each character is 8 bits (one byte). A blank space needs to be enumerated in this. A space is a character - just like a new line is a character (even if you don't normally see it).

Though, often, when you represent a new line in text "\n" it occupies 16 bits - or 2 bytes. (Even more off-topic.)

Of course, with UTF now leading the way, one character is 2 bytes. This is because we want to type more than just the US alphabet, the numbers, and a few basic symbols. We need more than 128 possible characters, from back in ye olden days when ASCII ruled the planet.

But, point being, a space is a character. It's just one that's rendered as a blank space. So, under the hood, it's counting it as a character. It may then ignore the spaces so that you can get an accurate word count, for example.

And then you have the modern-ish Emoji, which I believe are 3 *and* 4 byte Unicode characters - which isn't important, it's just interesting. They really just represent a number and your device renders them accordingly, if it can indeed recognize them. They'll render according to your device. Apple's Emojis will render differently (same theme/idea, just a slightly different picture) than Microsoft's. It depends on the device you're using. Like so:

✌️
 
Hi
Indeed, theres no spaces in the characters from the first line
all lines go like this
file 1: >Vibrio_cholerae_strain_1Mo
file 2: >Vibrio_cholerae_strain_39Ki
file 3: >Vibrio_cholerae_strain_107V1216
and so on....

I need to determine which files contains in the first line > 38 characters; counting the symbol ">"
 
Hi
Indeed, theres no spaces in the characters from the first line
all lines go like this
file 1: >Vibrio_cholerae_strain_1Mo
file 2: >Vibrio_cholerae_strain_39Ki
file 3: >Vibrio_cholerae_strain_107V1216
and so on....

I need to determine which files contains in the first line > 38 characters; counting the symbol ">"
Having specific targets such as these lines makes things much clearer and will enable you to be more precise. With this new info, my first suggestion is to delete the second sed command from the code in post #4 since there are no spaces. That should output the number of characters in the lines. Then you can add a conditional statement to show which lines contain 38 characters, including the symbol ">". There are a few ways of doing it. I'll have a think about it, meanwhile you may be able to get something together.
 
@osprey - @jhcuarta wants to know which files have more than 38 characters in their first line. Spaces and tabs actually counts as characters. I don't know why you're going out of the way to avoid them. I didn't see anything in their posts about excluding whitespace characters.

If they're dealing with a lot of files, the best bet would be to use a for loop to iterate through the filenames and then use head and wc on each file, like this:
Bash:
for file in ./*.fasta; do
    numChars=$( head -n 1 "$file" | wc -m )
    if [[ $numChars > 38 ]]; then
      echo "$file has $numChars characters in line 1"
    fi
done
Above would list all files that have more than 38 characters in the first line. Assuming that ALL files are in the current working directory.

The above code:
1. Sets up a for loop, to iterate through ALL .fasta files in the current directory
2. Count the number of characters in the first line of the file
3. If the character count is greater than 38, display the filename and the number of characters in the first line.
Simple!

I've put the above into more of a script format, because it's easier to read and understand.
But as a one-liner, that would look something like this:
Bash:
for file in ./*.fasta ; do numChars=$(head -n 1 "$file" | wc -m ) ; if [[ $numChars > 38 ]] ; then echo "$file has $numChars characters in line 1" ; fi ; done
 
Last edited:
I've just seen the newer posts - they must have come in whilst I was typing!!
I'll make some quick edits to the above, to take the .fasta extension into account.
 
Hi
I removed the sed part, nevertheless this is what i got

sed -n '1p' *.fasta |wc -m |awk '{print($1-1)}'
32

I need to do it for several files, thats why I use *.fasta, but it seems it only counted one file and didn´t display which one
 
Hi
I removed the sed part, nevertheless this is what i got

sed -n '1p' *.fasta |wc -m |awk '{print($1-1)}'
32

I need to do it for several files, thats why I use *.fasta, but it seems it only counted one file and didn´t display which one
Take a look at my proposed solution a couple of posts above.

Edit: If you only want the file-names of the files that contain more than 38 characters in their first line, I can condense the solution down a little further.

Bash:
for file in ./*.fasta; do
  if [[ $(head -n 1 "$file" | wc -m) > 38 ]]; then
    echo "$file"
  fi
done

Which would make a slightly shorter one-liner:
Bash:
for file in ./*.fasta; do if [[ $(head -n 1 "$file" | wc -m) > 38 ]]; then echo "$file"; fi; done
 
Last edited:

Members online


Top