Linux Online Advertisement
[ Register ]

[ Applications ]
[ Documentation ]
[ Distributions ]
[ Download Info ]
[ General Info ]
[ Book Store ]

Advertisement

[ Courses ]
[ News ]
[ People ]
[ Hardware ]
[ Vendors ]
[ Projects ]
[ Events ]
[ User Groups ]
[ User Area ]

Running Linux, Fourth Edition

[ About Us ]
[ Home Page ]
[ Advertise ]

Intermediate Level User Linux Course

Text processing and manipulation

Other text manipulation tools at your disposal

In your day-to-day work with a Linux server, you may find yourself needing to view, manipulate and change text data without "editing" files, per se. In this section we will look at various tools that come with standard Linux distributions for dealing with text data

Reading and changing a data source: An example

A Linux machine is going to generate a lot of data in its day-to-day activity. Most of this is going to be very important for your management of the machine. Let's take an example. If you have the Apache web server installed, it constantly pumps data to two important files, access and errors. The access file logs every visit to your web site. Every time somebody (or something) visits your website, Apache creates an entry in this file with the date, hour and information the file that was requested. The errors file is there to record errors such as server misconfiguration or faulty CGI scripts.

You can find good software out there for processing your access file. It will tell you how many hits you got on your webpage. It will even automatically look for requests for missing or non-existent web pages, known as 404s. Some of this software is very good at generating informative reports of traffic, warnings of missing pages and other website statistics, but it is aimed at website administrators in general and as such presents its data in a general way. At times though, you may want or need to do a bit of hands on evaluation of the data in the file - to go beyond what these web stats packages offer.

Let's say you wanted to know how many machines infected with the Windows viruses NIMDA and CodeRed, which try to exploit flaws in Microsoft's Internet Information Server (IIS), have tried to infect your Apache server (in vain, of course). A general web statistics package will normally show you only the data for missing pages (404s) which these attacks will generate. Using standard tools that all Linux distributions provide, you'll be able to get more specific information from the access file. For example, you could use grep on the file like so:

grep "\(root.exe\|cmd.exe\)" access

This will display all the attempts in your terminal window. That, in and of itself might not be particularly interesting so let's redirect it to a file where we might look at the data a little more closely.

grep "\(root.exe\|cmd.exe\|default.ida\)" access > nimda_attempts

Now I can take a look at the file for more specific information. If I were curious to see if anybody on a certain netblock were infected with this worm, I could 'grep' the file I just generated.

grep '195.199' nimda_attempts

and we'd get something like this:

195.199.X.X - - [15/Jul/2003:13:39:00 +0200] "GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir" 404 -
195.199.X.X - - [15/Jul/2003:15:33:54 +0200] "GET /scripts/..%255c%255c../winnt/system32/cmd.exe?/c+dir" 404 -

Note

These are real hits on one of my webservers. I've X'd out the complete IPs as to not embarrass these administrators (though thoroughly embarrassed they should be! - a patch has been available for these since 2001)

Well, let's just hypothetically say that you wanted to embarrass them. You could extract the IPs from the file and post them on your "Shame on you for not patching your IIS" website. First, let's see how many virus hits we have. Let's pass the output of grep to 'wc' and count the lines.

grep "\(root.exe\|cmd.exe\|default.ida\)" access | wc -l

253

That's a lot of infected sites! Well, we're not going to cut and paste every one of the IP addresses from our text editor. That would be too tedious. This calls for a little text manipulation. That means we can cut, but there just won't be any paste. Enter the tool cut. cut is a command line tool for removing text from a file without actually having to open it. In this example, we'll just get the IP addresses from the file

cat nimda_attempts | cut -c 1-16 > infected_hosts

We'd get just the IP addresses of infected hosts in the file. We could even go one step further: use the tool sort to put the IP numbers in order

cat nimda_attempts | cut -c 1-16 | sort > infected_hosts

The option we use with cut is -c 1-16 which will cut everything but the first 16 characters of each line. That gives us enough to get every possible IP addresses. I opened the file and I saw that we still have some dashes there after some of the shorter IP addresses. We don't have to worry too much about removing them either. We have another great tool named sed . We can now take the file infected_hosts and get rid of all those dashes. We can even do it in one fell swoop.

cat nimda_attempts | cut -c 1-15 | sort > infected_hosts ; cat infected_hosts | sed 's/ -//g' > l00ser_admins

You'll see that we had to use some options with sed. The 's' option means to substitute. So we substitute any dash with a space in front of it (/ -/) with nothing (//). The 'g' option means globally (ie, throughout the whole file). We pass that on to our final file, l00ser_admins.

The beauty of this is that we've never opened up the file in a text editor. Even if we had chosen to open our access file in a text editor like Emacs, we couldn't have done all of this handling of the data at once. You'll find that the ability to do things like this will save you lots of time, so let's look at the various tools one by one so you can get a better feel for how they work.

I can grep that!

Any information that you might want to get from a file is at your fingertips, thanks to grep. Despite rumors to the contrary that grep is Vulcan for find this word, grep stands for General Regular Expression Parser. And now that you know that, I'm sure you feel a lot better. Actually the more proficient you are at dealing with regular expressions, the better you will be at systems administration. Let's look at another example from our Apache access file.

Let's say you're interested in looking at the number of visits to your website for the month of June. You would do a simple grep on the access file, like so:

grep -c 'Jun/2003' access

This will give you the number of files requested during the month of June. If you were interested in this month's requests, we could add this little twist:

grep -c `date +%b` access

What we've done basically is put a command in the way, so to speak. Grep will take the output for that command, which is the name of the month (%b option in date will give you the abbreviation of the name of the month) and look for it in the file.

Warning

Commands that are actually options of other commands must be placed inside the backward or grave accent marks, or else they won't work.

You can actually do some pretty cool stuff with your Apache access file using grep plus the date command. You might want to look for just today's hits:

grep -c `date +%d/%b` access

Maybe you'd like to see what the hits were like for the same day but other months:

grep -c `date +%d/` access

Perhaps you were expecting more information? Some Linux distributions will install special scripts that will run a cron job to compress the Apache access files. Even if your old file are compressed, say, each month, you can still grep them for information without having to de-compress them. Grep has a version that can look it files compressed with gzip. It's called, appropriately zgrep

zgrep -c `date +%d/` access_062003.gz

This will look for the hits on the webserver for this same day, but in June in your gzip'd access file. (unless of course, it's the 31st).

Speaking of gzip, you can actually do the reverse. You can grep files and, based on the results, create a gzip'd file. For example, if you wanted to create a file of the hits this month to date, you could do this:

grep `date +%b/` access | gzip -c > access_01-20jul.gz

There is really no end to what grep will furnish for you, as long as you some idea of what you're looking for. The truth is that we've been concentrating on the Apache files. There are dozens of files that might interest you at any given moment and all can be "grep'd' to give you important information. For example, you might be interested in how many messages you have in your in-box. Again, grep can provide you with that information.

grep -c '^From:' /var/spool/mail/penguin

If you're interested in who the mail is from, then just leave off the -c option. You should see something like this:

From: fred@tidlywinks.con
From: webmaster@perly.ork
From: "F. Scott Free" < fsfree@freebird.ork>From: "Sarah Doktorindahaus" < sarah@hotmael.con>

The caret (^) indicates that grep should look for every line that starts with the expression you want. In this case, the 'From:' header in standard email messages.

Speaking of mail, let's say that somebody sent you an email yesterday with the name of a person that you needed to contact and his/her phone number. You wouldn't even have to open your email client to get this information. You could just look for numbers formatted like phone numbers in your inbox.

grep '[0-9]\{3\}-[0-9]\{4\}' inbox

Phone numbers are usually formatted as the exchange (555 - 3 numbers) and the number itself (6677 - 4 numbers), so grep will look for any numbers grouped in 3 and 4 respectively with a slash (-) between them.

We used the caret (^) to look for lines that began with certain characters. We can also look for lines that ends with specific characters with the dollar ($) sign. For example, if we wanted to know which users on our system use the bash shell, we would do this:

grep bash$ /etc/passwd

In this case, the shell used is the last thing listed on each line of the /etc/passwd file, so you'd see something like this as output:

root:x:0:0:root:/root:/bin/bash
mike:x:500:500:mike:/home/mike:/bin/bash
dave:x:501:501:dave:/home/dave:/bin/bash
laura:x:502:502:laura:/home/laura:/bin/bash
jeff:x:503:503:jeff:/home/jeff:/bin/bash

Using grep with other commands

One of the more common uses of grep in administration tasks is to pipe other commands to it. For example, you're curious as to how much of your system's resources you're hogging .. eh hem.. using:

ps uax | grep $USER

Or how about looking for the files you created during the month of October:

ls -l | grep Oct

Again, we could go on for a long, long time, but we have other commands to look at. Needless to say, you'll find grep to be one of your most valuable tools.

Feeling a little awk(ward)

awk is another one of those tools that will make your hunt for meaningful data a lot easier. awk is actually a programming language designed particularly for text manipulation, but it is widely used as an on-the-spot tool for administration.

For example, let's the return to the example I used earlier with grep. You want to see the processes that you're using. You can to this with awk as well.

ps uax | awk '/mike/'

does exactly the same as grep. However, the tool ps presents its data in a tabled format so awk is better suited to getting just the parts we want out of it more than grep is. The uax options we used above will present the information like so:

USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root      3355  0.0  0.1  1344  120 ?        S    Jul27   0:00 crond

So using awk, we can get specific information out of each column. For example, we could see what processes root is using and how much memory they're using. That would mean telling awk to get us the information about root's processes from columns 1,2 and 4:

 ps uax | awk '/root/ {print $1,$2,$4}'

This would get us some information like this:

root 1 0.3
root 2 0.0
root 3 0.0
root 4 0.0
root 9 0.0
root 5 0.0
root 6 0.0
root 7 0.0
root 8 0.0
root 10 0.0
root 11 0.0
root 3466 0.0
root 3467 0.0
root 3468 0.0
root 3469 0.0
root 3512 0.0
root 3513 7.9
root 14066 0.0

You'll notice that PID 3513 shows more memory usage that the others, so you could further use awk to see what's going on there. We'll add the column 11 which will show us the program that's using that memory.

ps uax | awk '/3513/ {print $1,$2,$4,$11}

And you'll see that X-window is the program in question:

root 3513 7.6 /usr/X11R6/bin/X

We could even take this one step further. We could use grep to get a total of the memory that our processes are using:

ps uax | awk '/^mike/ { x += $4 } END { print "total memory: " x  }'

awk is nice enough to accommodate with the total:

total memory: 46.8 

I am using almost half of the machine's memory. Nice to see that awk can do math too!

As it can do math, you can also use it to check out the total bytes of certain files. For example, the total bytes taken up by jpgs in your photo collection:

ls -l | awk '/jpg/ { x += $5 } END { print "total bytes: " x }'

Column 5 of ls -l shows the bytes of the file, and we just have awk add it all up.

Tengo 'sed'!

Note

sed, though an excellent tool, has been largely replaced by features in Perl which duplicate it somewhat. I say somewhat, because the use of sed in conjunction with pipes is somewhat more comfortable (at least to this author) for substitution of text in files.

Sed is the Spanish word for thirst, Though you may thirst (or hunger, or whatever) for an easier way to do things, its name did not derive from there. It stands for Stream editor. Sed, as you saw in our example, is basically used for altering a text file without actually having to open it up in a traditional editor. This results in an incredible time savings for a system administrator.

One of the most common uses of sed is to alter or eliminate text in a file. Let's say you want to see who last logged into your network. You would run a utility called lastlog. This will show you the last time that every user logged in. The fact is though that we're only interested in real human users, not users that are really daemons or processes or literally "nobody". If you just ran lastlog you'd get a bunch of entries like this:

mail                                       **Never logged in**
news                                       **Never logged in**
uucp                                       **Never logged in**

You can, of course, make this output more meaningful and succinct by bringing sed into the picture. Just pipe the output of lastlog and have sed eliminate all of the entries that contain the word Never.

lastlog | sed '/Never/d' > last_logins

In this example, sed will delete (hence the 'd' option) every line that contains this word and send the output to a file called last_logins. This will only contain the real-life users, as you can see:

Username         Port     From             Latest
fred             pts/3    s57a.acme.com    Mon Jul 14 08:45:49 +0200 2003
sarah            pts/2    s52b.acme.com    Mon Jul 14 08:01:27 +0200 2003
harry            pts/6    s54d.acme.com    Mon Jul 14 07:56:20 +0200 2003
carol            pts/4    s53e.acme.com    Mon Jul 14 07:57:05 +0200 2003
carlos           pts/5    s54a.acme.com    Mon Jul 14 08:07:41 +0200 2003

So you've made a nice little report here of people's last login times for your boss. Well, nice until you realize that you've shown that you (fred) logged in at 08:45, three quarters of an hour after you were supposed to be at work. Not to worry. Sed can fix your late sleeping habit too. All we have to do is substitute 08 for 07 and it looks like you're a real go-getter.

cat last_logins | sed 's/08/07/g' > last_logins2

What you've done is use the 's' option (for substitute) to change every instance of 08 to 07. So now it looks like you came in at 7:45. What a go-getter you are now! The only problem is that you have made those changes throughout the whole file. That would change the login times for sarah and carlos as well to 7:01 and 7:07 respectively. They look like real early birds! That might raise suspicion, so you need a way to change only the 08 on the line that shows your (fred) login. Again, we sed will come to our rescue.

sed '/fred/s/08/07/g'

And sed will make sure that only the line containing fred gets changed to 07.

Sed, of course, is more than just a text substitution tool (and a way to hide your tendency to over-sleep on Monday). Let's go back to a previous example of looking for MS Windows-based worms hitting our web server. Sed can actually do a better job of 'grepping' than grep can do. Let's say you wanted to look at all the CodeRed hits for just the month of July, you would do this:

cat access | sed '/default.ida/!d; /\/Jul/!d' > MS_Exploits_July

The 'd' option, which we used before to delete text, can be used, with the exclamation point in front of it, to find text.

On the contrary, you may want to get a print out of just normal web traffic, sans all those Windows exploits. That's where just plain 'd' comes in handy again.

cat access | sed '/^.\{200\}/d' > normal_traffic

This will eliminate any line of less that 200 characters, which means it will show most requests for normal files, and not CodeRed or Nimda hits which are longer. Now, using reverse logic, we now have a different method CodeRed/Nimda hits by analyzing hits that are more than 200 characters in length. For this, we'll make a few modifications:

cat access | sed -n '/^.\{220\}/p' > MS_exploits

The difference with this example is the -n option. This tells sed to limit its action to the other option designated 'p', which means pattern. In other words, limit what you show to those lines matching a pattern of 220 characters in length or more.

There are a lot more ways you can use sed to make changes to files. One of the best ways to find out more about sed is to see how others use it and to learn from their examples. If you go to Google and type: +sed +one liners, you can get all kinds of examples of how sed can be used. It's an excellent way to polish your skills.

Using Uniq

Uniq is a good tool for weeding out useless information in files. It's the Unix/Linux way of separating the wheat from the chaff, so to speak. Let's say that you were involved in some sort of experiment where you had to observe some time of behavior and report on it. Let's say you were observing somebody's sleep habits. You had to watch somebody sleeping and report on it every 10 minutes, or whenever there was a change. You might sit in front of your terminal and issue this command, for example:

echo `date +%y-%m-%d_AT_%T`   No changes >> sleep_experiment_43B

And when you see some changes, you could issue this command:

echo `date +%y-%m-%d_AT_%T`   subject moved right arm >> sleep_experiment_43B

You'd end up with a file that looks like this:

03-08-09_AT_23:10:16 No change
03-08-09_AT_23:20:24 No change
03-08-09_AT_23:30:29 No change
03-08-09_AT_23:40:31 No change
03-08-09_AT_23:50:33 No change
03-08-09_AT_00:00:34 No change
03-08-09_AT_00:10:35 No change
03-08-09_AT_00:20:37 No change
03-08-09_AT_00:30:05 subject rolled over
03-08-09_AT_00:40:12 No change
03-08-09_AT_00:50:13 No change
03-08-09_AT_01:00:50 subject moved left leg
03-08-09_AT_01:10:17 No change
03-08-09_AT_01:20:18 No change
03-08-09_AT_01:30:19 No change
03-08-09_AT_01:40:20 No change
03-08-09_AT_01:50:47 subject moved right arm
03-08-09_AT_02:00:11 No change
03-08-09_AT_02:10:20 subject scratched nose

If this file went on until 07:00, when the subject finally wakes up, you might have a lot of entries in there with 'no change', which is fairly uninteresting. What if you wanted to see just the changes in sleep behavior? Uniq is your tool. Uniq will show you just the 'unique' lines or lines with different information. You may be thinking: But all the lines are unique, really. That's correct. All the times are different, but we can adjust for that on the command line.

uniq -f 1 sleep_experiment_43B

This tells uniq to skip the first field, which is the data field and just list the unique fields. You'll end up with something like this:

03-08-09_AT_23:10:16    No change
03-08-09_AT_00:30:05    subject rolled over
03-08-09_AT_01:00:50    subject moved left leg
03-08-09_AT_01:50:47    subject moved right arm
03-08-09_AT_02:10:20    subject scratched nose

Now you're probably saying: I want to run Linux machines, not conduct sleep experiments. How do I use 'uniq' along these lines?. Well, let's go back to our example of looking for users last login times. If you remember, lastlog got that information for us, but it also listed users that weren't real people. We narrowed it down to real logged in users by invoking sed. The only drawback to that is that we got only users that had logged in. If you want to see all real users whether they have logged in or not you could do this:

lastlog | uniq -f 4

The fourth field of the output of lastlog is the one that indicates **Never logged in**, so This will find all real users whether or not they have logged in or not. This is good for finding out which users have actually used our system. Some people are given shell accounts and then they never use them. You can then weed these people out.

Sort this out

Another text manipulation tool that will come in handy is sort. This tool takes a text file or the output of another command and "sorts" it out (puts it in some order) according to the options to choose. Using sort without any options will just put the lines of a file in order. Let's imagine that you've got a grocery list that looks something like this:

chocolate      
ketchup
detergent
cola       
chicken     
mustard
bleach
ham
rice
bread
croissants
ice-cream
hamburgers
cookies
spaghetti

In order to put this in alphabetical order, you'd just type:

sort grocery_list

That would give you a nice list starting with bleach and ending with spaghetti. But let's say you're smarter than you're average shopper and you have also written down where the items can be found, saving you time. Let's say you're list looks like this:

chocolate       aisle 3
ketchup         aisle 9
detergent       aisle 6
cola            aisle 5
chicken         meat dept
mustard         aisle 9
bleach          aisle 6
ham             deli counter
rice            aisle 4
bread           aisle 1
croissants      aisle 1
ice-cream       aisle 2
hamburgers      meat dept
cookies         aisle 3
spaghetti       aisle 4

To make your trip to the supermarket faster, you could sort the second column of the list like so:

sort +2 grocery_list

The +2 (+ [column]) means to sort it according to the second column. Now you'll get everything nicely sorted by section:

bread           aisle 1
croissants      aisle 1
ice-cream       aisle 2
chocolate       aisle 3
cookies         aisle 3
rice            aisle 4
spaghetti       aisle 4
cola            aisle 5
bleach          aisle 6
detergent       aisle 6
ketchup         aisle 9
mustard         aisle 9
ham             deli counter
chicken         meat dept
hamburgers      meat dept

Again, you're probably saying: Will being a more efficient shopper help me with my system administration tasks? The answer is: Yes, of course! But let's look at another example that has to so with our system.

tail is another one of the more useful commands on a system. tail will show you the last 10 lines of a file. But if you use 'cat' with a pipe to sort and 'more', you get a tail that's sort of interactive (I couldn't resist).

cat /var/log/mail.log | sort -r |more

The -r option stand for reverse, so here will see the entries in your mail server logs beginning with the most recent. As you push enter, you'll begin to go to the older entries.

Cut to the chase

Sometimes the output of a program gives you too much information. You'll have to cut some of it out. That's where the program cut comes in handy. Earlier we saw how to get only the IP addresses of the visitors to the website. Here's a refresher, with a twist.

cat access | cut -c1-16 > IP_visitors

This would get us a file with all of the IP addresses. We'll get back to this in a minute.

There are, of course, other practical uses. The other day a client asked me to let a user on the system use Hylafax, a fax server for Linux. One of the requirements of letting a user access Hylafax is knowing his or her user ID or UID. This information is located in /etc/passwd and there is a quick way to get at it:

cat /etc/passwd | grep bob | cut -f1,3 -d":"

Basically what we've done is to grep the user bob out of the passwd file and pass it to cut and take the first and third (1,3) fields which are separated (delimited -d) by a colon (:). The result is this:

bob:1010

User 'bob' has a UID of 1010.

Let's look at our Apache weblog example again. We can combine a few of the text manipulation tools we've learned to see how many unique IP addresses have visited our website.

cat access | cut -f1-2 -d" " | sort | uniq | wc -l

This is a great example of the Unix way of doing things. We've cut out the first field (-f1-2 - shows us only fields 1 to 2 delimited by whitespace -d""). We pipe it to sort, which puts them in numerical order. The uniq tool then shows only the "unique" IP address. Finally, 'wc -l' counts the lines. You really have to ask yourself: What would it cost me to get this info with a normal word processor or text editor?



Comments: feedback (at) linux.org
Advertising: banners (at) linux.org
Copyright Linux Online Inc.
Compilation ©1994-2008 Linux Online, Inc.
All rights reserved.