Other text manipulation tools at your disposal
In your day-to-day work with a Linux server, you may find yourself needing
to view, manipulate and change text data without "editing" files, per se.
In this section we will look at various tools that come with standard
Linux distributions for dealing with text data
Reading and changing a data source: An example
A Linux machine is going to generate a lot of data in its day-to-day
activity. Most of this is going to be very important for your management
of the machine. Let's take an example. If you have the
Apache web server installed, it constantly pumps data to
two important files, access and
errors.
The access file logs every visit to your web site. Every time
somebody (or something) visits your website, Apache creates an entry in this
file with the date, hour and information the file that was requested.
The errors file is there to record errors such as
server misconfiguration or faulty CGI scripts.
You can find good software out there for
processing your access file. It will tell you how many
hits you got on your webpage. It will even automatically look for
requests for missing or non-existent web pages, known as 404s.
Some of this software is very good at generating informative reports of
traffic, warnings of missing pages and other website statistics, but it
is aimed at website administrators in general and as such presents its
data in a general way. At times though, you may want or need to
do a bit of hands on evaluation of the data in the file - to go beyond
what these web stats packages offer.
Let's say you wanted to know how many machines infected with the
Windows viruses NIMDA and CodeRed, which try to exploit flaws in Microsoft's
Internet Information Server (IIS), have tried to infect your Apache
server (in vain, of course). A general web statistics package will normally
show you only the data for missing pages (404s) which these attacks will generate.
Using standard tools that all Linux distributions provide, you'll
be able to get more specific information from the access
file. For example, you could use grep on the
file like so:
grep "\(root.exe\|cmd.exe\)" access |
This will display all the attempts in your terminal window. That, in and of itself
might not be particularly interesting so let's redirect it to a file where we
might look at the data a little more closely.
grep "\(root.exe\|cmd.exe\|default.ida\)" access > nimda_attempts |
Now I can take a look at the file for more specific information. If I were curious
to see if anybody on a certain netblock were infected with this worm, I could
'grep' the file I just generated.
grep '195.199' nimda_attempts |
and we'd get something like this:
195.199.X.X - - [15/Jul/2003:13:39:00 +0200] "GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir" 404 -
195.199.X.X - - [15/Jul/2003:15:33:54 +0200] "GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir" 404 - |
 | These are real hits on one of my webservers. I've X'd out the complete
IPs as to not embarrass these administrators (though thoroughly
embarrassed they should be! - a patch has been available for these since
2001) |
Well, let's just hypothetically say that you wanted
to embarrass them. You could extract the IPs from the file and post
them on your "Shame on you for not patching your IIS" website. First, let's
see how many virus hits we have. Let's pass the output of grep to 'wc'
and count the lines.
grep "\(root.exe\|cmd.exe\|default.ida\)" access | wc -l |
That's a lot of infected sites! Well, we're not going to cut and paste
every one of the IP addresses from our text editor. That would be
too tedious. This calls for a little text manipulation. That means
we can cut, but there just won't be any
paste. Enter the tool cut.
cut is a command line tool for removing text
from a file without actually having to open it. In this example, we'll
just get the IP addresses from the file
cat nimda_attempts | cut -c 1-16 > infected_hosts |
We'd get just the IP addresses of infected hosts in the file. We could
even go one step further: use the tool sort to
put the IP numbers in order
cat nimda_attempts | cut -c 1-16 | sort > infected_hosts |
The option we use with cut is -c 1-16 which
will cut everything but the first 16 characters of each line. That
gives us enough to get every possible IP addresses. I opened the
file and I saw that we still have some dashes there after some of
the shorter IP addresses. We don't have to worry too much about
removing them either. We have another great tool named
sed . We can now take the file infected_hosts
and get rid of all those dashes. We can even do it in one fell swoop.
cat nimda_attempts | cut -c 1-15 | sort > infected_hosts ; cat infected_hosts | sed 's/ -//g' > l00ser_admins |
You'll see that we had to use some options with sed. The 's' option means
to substitute. So we substitute any dash with a space
in front of it (/ -/) with nothing (//). The 'g' option means
globally (ie, throughout the whole file). We pass
that on to our final file, l00ser_admins.
The beauty of this is that we've never opened up the file in a text editor. Even
if we had chosen to open our access file in a text
editor like Emacs, we couldn't have done all of this handling of the data
at once. You'll find that the ability to do things like this will save you
lots of time, so let's look at the various tools one by one so you can
get a better feel for how they work.
I can grep that!
Any information that you might want to get from a file is at your fingertips,
thanks to grep. Despite rumors to the contrary that grep is Vulcan for
find this word, grep stands for
General Regular Expression Parser. And now that you
know that, I'm sure you feel a lot better. Actually the more proficient you
are at dealing with regular expressions, the better you will be at systems
administration. Let's look at another example from our Apache
access file.
Let's say you're interested in looking at the number of visits
to your website for the month of June. You would do a simple grep on
the access file, like so:
grep -c 'Jun/2003' access |
This will give you the number of files requested during the month of June. If
you were interested in this month's requests, we could add this little twist:
grep -c `date +%b` access |
What we've done basically is put a command in the way, so to
speak. Grep will take the output for that command, which is the name
of the month (%b option in date will give you the
abbreviation of the name of the month) and look for it in the file.
 | Commands that are actually options of other commands must be placed
inside the backward or grave accent marks, or
else they won't work. |
You can actually do some pretty cool stuff with your Apache
access file using grep plus the date command. You might want to look for
just today's hits:
grep -c `date +%d/%b` access |
Maybe you'd like to see what the hits were like for the same day but
other months:
grep -c `date +%d/` access |
Perhaps you were expecting more information? Some Linux distributions will
install special scripts that will run a cron job to compress the Apache
access files. Even if your old file are compressed, say, each month, you can
still grep them for information without having to de-compress them.
Grep has a version that can look it files compressed with gzip. It's called,
appropriately zgrep
zgrep -c `date +%d/` access_062003.gz |
This will look for the hits on the webserver for this same day, but in June in
your gzip'd access file. (unless of course, it's the 31st).
Speaking of gzip, you can actually do the reverse. You can grep files and, based
on the results, create a gzip'd file. For example, if you wanted to create a file
of the hits this month to date, you could do this:
grep `date +%b/` access | gzip -c > access_01-20jul.gz |
There is really no end to what grep will furnish for you, as long as you
some idea of what you're looking for. The truth is that we've been concentrating
on the Apache files. There are dozens of files that might interest you at
any given moment and all can be "grep'd' to give you important information.
For example, you might be interested in how many messages you have in your
in-box. Again, grep can provide you with that information.
grep -c '^From:' /var/spool/mail/penguin |
If you're interested in who the mail is from, then just leave off the
-c option. You should see something like this:
From: fred@tidlywinks.con
From: webmaster@perly.ork
From: "F. Scott Free" < fsfree@freebird.ork>From: "Sarah Doktorindahaus" < sarah@hotmael.con> |
The caret (^) indicates that grep should look for every line that starts
with the expression you want. In this case, the 'From:' header in
standard email messages.
Speaking of mail, let's say that somebody sent you an email yesterday with
the name of a person that you needed to contact and his/her phone number.
You wouldn't even have to open your email client to get this information.
You could just look for numbers formatted like phone numbers in your inbox.
grep '[0-9]\{3\}-[0-9]\{4\}' inbox |
Phone numbers are usually formatted as the exchange (555 - 3 numbers) and the
number itself (6677 - 4 numbers), so grep will look for any numbers grouped
in 3 and 4 respectively with a slash (-) between them.
We used the caret (^) to look for lines that began with certain characters.
We can also look for lines that ends with specific
characters with the dollar ($) sign. For example, if we wanted to know
which users on our system use the bash shell,
we would do this:
grep bash$ /etc/passwd
In this case, the shell used is the last thing listed on each line
of the /etc/passwd file, so you'd see something like this as output:
root:x:0:0:root:/root:/bin/bash
mike:x:500:500:mike:/home/mike:/bin/bash
dave:x:501:501:dave:/home/dave:/bin/bash
laura:x:502:502:laura:/home/laura:/bin/bash
jeff:x:503:503:jeff:/home/jeff:/bin/bash |
Using grep with other commands
One of the more common uses of grep in administration tasks is to pipe
other commands to it. For example, you're curious as to how much of your
system's resources you're hogging .. eh hem.. using:
ps uax | grep $USER
Or how about looking for the files you created during the month of October:
ls -l | grep Oct
Again, we could go on for a long, long time, but we have other commands to
look at. Needless to say, you'll find grep to be
one of your most valuable tools.
Feeling a little awk(ward)
awk is another one of those tools that will make your
hunt for meaningful data a lot easier. awk is actually
a programming language designed particularly for text manipulation, but it
is widely used as an on-the-spot tool for administration.
For example, let's the return to the example I used earlier with grep. You
want to see the processes that you're using. You can to this with awk as well.
does exactly the same as grep. However, the tool ps
presents its data in a tabled format so awk is
better suited to getting just the parts we want out of it more than
grep is. The uax options we used above will
present the information like so:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 3355 0.0 0.1 1344 120 ? S Jul27 0:00 crond |
So using awk, we can get specific information out of each column. For
example, we could see what processes root is using and how much memory
they're using. That would mean telling awk to get us the information
about root's processes from columns 1,2 and 4:
ps uax | awk '/root/ {print $1,$2,$4}' |
This would get us some information like this:
root 1 0.3
root 2 0.0
root 3 0.0
root 4 0.0
root 9 0.0
root 5 0.0
root 6 0.0
root 7 0.0
root 8 0.0
root 10 0.0
root 11 0.0
root 3466 0.0
root 3467 0.0
root 3468 0.0
root 3469 0.0
root 3512 0.0
root 3513 7.9
root 14066 0.0 |
You'll notice that PID 3513 shows more memory usage that the others, so
you could further use awk to see what's going on there. We'll add
the column 11 which will show us the program that's using that memory.
ps uax | awk '/3513/ {print $1,$2,$4,$11} |
And you'll see that X-window is the program in question:
root 3513 7.6 /usr/X11R6/bin/X |
We could even take this one step further. We could use grep to get a total
of the memory that our processes are using:
ps uax | awk '/^mike/ { x += $4 } END { print "total memory: " x }' |
awk is nice enough to accommodate with the total:
I am using almost half of the machine's memory. Nice to see that
awk can do math too!
As it can do math, you can also use it to check out the total bytes
of certain files. For example, the total bytes taken up by jpgs
in your photo collection:
ls -l | awk '/jpg/ { x += $5 } END { print "total bytes: " x }' |
Column 5 of ls -l shows the bytes of the file, and
we just have awk add it all up.
Tengo 'sed'!
 | sed, though an excellent tool, has been largely replaced by features
in Perl which duplicate it somewhat. I say somewhat, because the use
of sed in conjunction with pipes is somewhat more comfortable (at
least to this author) for substitution of text in files. |
Sed is the Spanish word for
thirst,
Though you may thirst (or hunger, or whatever) for an easier way to do things, its
name did not derive from there. It stands for
Stream
editor. Sed, as you saw in our example, is basically
used for altering a text file without actually having to open it up
in a traditional editor. This results in an incredible time savings for a system
administrator.
One of the most common uses of sed is to alter or eliminate text in a file.
Let's say you want to see who last logged into your network. You would run
a utility called lastlog. This will show you the last
time that every user logged in. The fact is though that we're only interested
in real human users, not users that are really daemons or processes or
literally "nobody". If you just ran lastlog you'd get
a bunch of entries like this:
mail **Never logged in**
news **Never logged in**
uucp **Never logged in** |
You can, of course, make this output more meaningful and succinct by bringing
sed into the picture. Just pipe the output of lastlog and have sed eliminate
all of the entries that contain the word Never.
lastlog | sed '/Never/d' > last_logins |
In this example, sed will delete (hence the 'd' option) every line
that contains this word and send the output to a file called
last_logins. This will only contain the real-life
users, as you can see:
Username Port From Latest
fred pts/3 s57a.acme.com Mon Jul 14 08:45:49 +0200 2003
sarah pts/2 s52b.acme.com Mon Jul 14 08:01:27 +0200 2003
harry pts/6 s54d.acme.com Mon Jul 14 07:56:20 +0200 2003
carol pts/4 s53e.acme.com Mon Jul 14 07:57:05 +0200 2003
carlos pts/5 s54a.acme.com Mon Jul 14 08:07:41 +0200 2003 |
So you've made a nice little report here of people's last login times
for your boss. Well, nice until you realize that you've shown that
you (fred) logged in at 08:45, three quarters of an hour after you
were supposed to be at work. Not to worry. Sed can fix your late
sleeping habit too. All we have to do is substitute 08 for 07 and it
looks like you're a real go-getter.
cat last_logins | sed 's/08/07/g' > last_logins2
What you've
done is use the 's' option (for substitute) to change
every instance of 08 to 07. So now it looks like you came in at 7:45.
What a go-getter you are now! The only problem is that you have
made those changes throughout the
whole file. That would change the login times for
sarah and carlos as well to 7:01 and 7:07 respectively. They look like real
early birds! That might raise suspicion, so you need a way to change
only the 08 on the line that shows your (fred) login. Again, we sed
will come to our rescue.
And sed will make sure that only the line
containing fred gets changed to 07.
Sed, of course, is more than just a text substitution tool (and a way
to hide your tendency to over-sleep on Monday). Let's go back to a
previous example of looking for MS Windows-based worms hitting our web
server. Sed can actually do a better job of 'grepping' than grep can
do. Let's say you wanted to look at all the CodeRed hits for
just the month of July, you would do this:
cat access | sed '/default.ida/!d; /\/Jul/!d' > MS_Exploits_July |
The 'd' option, which we used before to
delete text, can be used, with the exclamation point in front of it, to
find text.
On the contrary, you may want to get a print out of just normal web traffic, sans all those Windows exploits. That's where just plain 'd' comes in handy again.
cat access | sed '/^.\{200\}/d' > normal_traffic |
This will eliminate any line of less that 200 characters, which means
it will show most requests for normal files, and not CodeRed or Nimda
hits which are longer. Now, using reverse logic, we now have a
different method CodeRed/Nimda hits by analyzing hits that are
more than 200 characters in length. For this,
we'll make a few modifications:
cat access | sed -n '/^.\{220\}/p' > MS_exploits |
The difference with this example is the -n option. This tells sed to limit
its action to the other option designated 'p', which means
pattern. In other words, limit what you show to
those lines matching a pattern of 220 characters in length or more.
There are a lot more ways you can use sed to make changes to files. One
of the best ways to find out more about sed is to see how others use it
and to learn from their examples. If you go to Google and type:
+sed +one liners, you can get all kinds of examples
of how sed can be used. It's an excellent way to polish your skills.
Using Uniq
Uniq is a good tool for weeding out useless information in files. It's
the Unix/Linux way of separating the wheat from the chaff, so to
speak. Let's say that you were involved in some sort of experiment
where you had to observe some time of behavior and report on it. Let's
say you were observing somebody's sleep habits. You had to watch
somebody sleeping and report on it every 10 minutes, or whenever there
was a change. You might sit in front of your terminal and issue this
command, for example:
echo `date +%y-%m-%d_AT_%T` No changes >> sleep_experiment_43B |
And when you see some changes, you could issue this command:
echo `date +%y-%m-%d_AT_%T` subject moved right arm >> sleep_experiment_43B |
You'd end up with a file that looks like this:
03-08-09_AT_23:10:16 No change
03-08-09_AT_23:20:24 No change
03-08-09_AT_23:30:29 No change
03-08-09_AT_23:40:31 No change
03-08-09_AT_23:50:33 No change
03-08-09_AT_00:00:34 No change
03-08-09_AT_00:10:35 No change
03-08-09_AT_00:20:37 No change
03-08-09_AT_00:30:05 subject rolled over
03-08-09_AT_00:40:12 No change
03-08-09_AT_00:50:13 No change
03-08-09_AT_01:00:50 subject moved left leg
03-08-09_AT_01:10:17 No change
03-08-09_AT_01:20:18 No change
03-08-09_AT_01:30:19 No change
03-08-09_AT_01:40:20 No change
03-08-09_AT_01:50:47 subject moved right arm
03-08-09_AT_02:00:11 No change
03-08-09_AT_02:10:20 subject scratched nose |
If this file went on until 07:00, when the subject finally wakes up, you
might have a lot of entries in there with 'no change', which is fairly
uninteresting. What if you wanted to see just the changes in sleep
behavior? Uniq is your tool. Uniq will show you just the 'unique' lines
or lines with different information. You may be thinking:
But all the lines are unique, really. That's correct.
All the times are different, but we can adjust for that on the command line.
uniq -f 1 sleep_experiment_43B |
This tells uniq to skip the first field, which is the data field and just
list the unique fields. You'll end up with something like this:
03-08-09_AT_23:10:16 No change
03-08-09_AT_00:30:05 subject rolled over
03-08-09_AT_01:00:50 subject moved left leg
03-08-09_AT_01:50:47 subject moved right arm
03-08-09_AT_02:10:20 subject scratched nose |
Now you're probably saying: I want to run Linux machines,
not conduct sleep experiments. How do I use 'uniq' along these
lines?. Well, let's go back to our example of looking for
users last login times. If you remember, lastlog
got that information for us, but it also listed users that weren't
real people. We narrowed it down to real logged in users by invoking
sed. The only drawback to that is that we got
only users that had logged in. If you want to see
all real users whether they have logged in or
not you could do this:
The fourth field of the output of lastlog is the one that
indicates **Never logged in**, so This will find all real users
whether or not they have logged in or not. This is good for finding
out which users have actually used our system. Some people are given
shell accounts and then they never use them. You can then weed these
people out.
Sort this out
Another text manipulation tool that will come in handy is sort. This tool takes a text file or the output of another command and "sorts"
it out (puts it in some order) according to the options to choose. Using sort
without any options will just put the lines of a file in order. Let's imagine
that you've got a grocery list that looks something like this:
chocolate
ketchup
detergent
cola
chicken
mustard
bleach
ham
rice
bread
croissants
ice-cream
hamburgers
cookies
spaghetti |
In order to put this in alphabetical order, you'd just type:
That would give you a nice list starting with bleach
and ending with spaghetti. But let's say you're smarter
than you're average shopper and you have also written down where the items
can be found, saving you time. Let's say you're list looks like this:
chocolate aisle 3
ketchup aisle 9
detergent aisle 6
cola aisle 5
chicken meat dept
mustard aisle 9
bleach aisle 6
ham deli counter
rice aisle 4
bread aisle 1
croissants aisle 1
ice-cream aisle 2
hamburgers meat dept
cookies aisle 3
spaghetti aisle 4 |
To make your trip to the supermarket faster, you could sort the second
column of the list like so:
The +2 (+ [column]) means to sort it according to the second column. Now
you'll get everything nicely sorted by section:
bread aisle 1
croissants aisle 1
ice-cream aisle 2
chocolate aisle 3
cookies aisle 3
rice aisle 4
spaghetti aisle 4
cola aisle 5
bleach aisle 6
detergent aisle 6
ketchup aisle 9
mustard aisle 9
ham deli counter
chicken meat dept
hamburgers meat dept |
Again, you're probably saying: Will being a more efficient
shopper help me with my system administration tasks? The
answer is: Yes, of course! But let's look at another example that has
to so with our system.
tail is another one of the more useful commands on
a system. tail will show you the last 10 lines of a
file. But if you use 'cat' with a pipe to sort and 'more', you get a
tail that's sort of interactive (I couldn't resist).
cat /var/log/mail.log | sort -r |more |
The -r option stand for reverse,
so here will see the entries in your mail server logs beginning with the most recent. As you push enter, you'll begin to go to the older entries.
Cut to the chase
Sometimes the output of a program gives you too
much information. You'll have to cut some of it out. That's
where the program cut comes in handy. Earlier we saw how
to get only the IP addresses of the visitors to the website. Here's a
refresher, with a twist.
cat access | cut -c1-16 > IP_visitors |
This would get us a file with all of the IP addresses. We'll get back to this in a minute.
There are, of course,
other practical uses. The other day a client asked me to let a user on the system
use Hylafax, a fax server for Linux. One of the requirements of letting a user
access Hylafax is knowing his or her user ID or UID. This information is located
in /etc/passwd and there is a quick way to get at it:
cat /etc/passwd | grep bob | cut -f1,3 -d":" |
Basically what we've done is to grep the user bob out of the passwd file and
pass it to cut and take the first and third (1,3) fields which are separated
(delimited -d) by a colon (:). The result is this:
User 'bob' has a UID of 1010.
Let's look at our Apache weblog example again. We can combine a few of the
text manipulation tools we've learned to see how many unique IP addresses
have visited our website.
cat access | cut -f1-2 -d" " | sort | uniq | wc -l |
This is a great example of the Unix way of doing
things. We've cut out the first field (-f1-2 - shows us only fields 1
to 2 delimited by whitespace -d""). We pipe it to sort, which puts
them in numerical order. The uniq tool then shows only the "unique" IP
address. Finally, 'wc -l' counts the lines. You really have to ask
yourself: What would it cost me to get this info with a
normal word processor or text editor?