Text Processing and Manipulation


Staff member
Oct 27, 2011
Reaction score
In your day-to-day work with a Linux server, you may find yourself needing to view, manipulate and change text data without "editing" files, per se. In this section we will look at various tools that come with standard Linux distributions for dealing with text data.

Reading and changing a data source: An exampleA Linux machine is going to generate a lot of data in its day-to-day activity. Most of this is going to be very important for your management of the machine. Let's take an example. If you have the Apache web server installed, it constantly pumps data to two important files, access and errors. The access file logs every visit to your web site. Every time somebody (or something) visits your website, Apache creates an entry in this file with the date, hour and information the file that was requested. The errors file is there to record errors such as server misconfiguration or faulty CGI scripts.

You can find good software out there for processing your access file. It will tell you how many hits you got on your webpage. It will even automatically look for requests for missing or non-existent web pages, known as 404s. Some of this software is very good at generating informative reports of traffic, warnings of missing pages and other website statistics, but it is aimed at website administrators in general and as such presents its data in a general way. At times though, you may want or need to do a bit of hands on evaluation of the data in the file - to go beyond what these web stats packages offer.

I can grep that!

Any information that you might want to get from a file is at your fingertips, thanks to grep. Despite rumors to the contrary that grep is Vulcan forfind this word, grep stands for General Regular Expression Parser. And now that you know that, I'm sure you feel a lot better. Actually the more proficient you are at dealing with regular expressions, the better you will be at systems administration. Let's look at another example from our Apache access file.

Let's say you're interested in looking at the number of visits to your website for the month of June. You would do a simple grep on the access file, like so:

grep -c 'Jun/2003' access

This will give you the number of files requested during the month of June. If you were interested in this month's requests, we could add this little twist:

grep -c `date +%b` access

What we've done basically is put a command in the way, so to speak. Grep will take the output for that command, which is the name of the month (%b option in date will give you the abbreviation of the name of the month) and look for it in the file.

Commands that are actually options of other commands must be placed inside the backward or grave accent marks, or else they won't work.

You can actually do some pretty cool stuff with your Apache access file using grep plus the date command. You might want to look for just today's hits:

grep -c `date +%d/%b` access

Maybe you'd like to see what the hits were like f or the same day but other months:

grep -c `date +%d/` access

Perhaps you were expecting more information? Some Linux distributions will install special scripts that will run a cron job to compress the Apache access files. Even if your old file are compressed, say, each month, you can still grep them for information without having to de-compress them. Grep has a version that can look it files compressed with gzip. It's called, appropriately zgrep

zgrep -c `date +%d/` access_062003.gz

This will look for the hits on the webserver for this same day, but in June in your gzip'd access file. (unless of course, it's the 31st).

Speaking of gzip, you can actually do the reverse. You can grep files and, based on the results, create a gzip'd file. For example, if you wanted to create a file of the hits this month to date, you could do this:

grep `date +%b/` access | gzip -c > access_01-20jul.gz

There is really no end to what grep will furnish for you, as long as you some idea of what you're looking for. The truth is that we've been concentrating on the Apache files. There are dozens of files that might interest you at any given moment and all can be "grep'd' to give you important information. For example, you might be interested in how many messages you have in your in-box. Again, grep can provide you with that information.

grep -c '^From:' /var/spool/mail/penguin

If you're interested in who the mail is from, then just leave off the -c option. You should see something like this:

From: [email protected]
From: [email protected]
From: "F. Scott Free" < [email protected]>From: "Sarah Doktorindahaus" < [email protected]>

The caret (^) indicates that grep should look for every line that starts with the expression you want. In this case, the 'From:' header in standard email messages.

Speaking of mail, let's say that somebody sent you an email yesterday with the name of a person that you needed to contact and his/her phone number. You wouldn't even have to open your email client to get this information. You could just look for numbers formatted like phone numbers in your inbox.

grep '[0-9]{3}-[0-9]{4}' inbox

Phone numbers are usually formatted as the exchange (555 - 3 numbers) and the number itself (6677 - 4 numbers), so grep will look for any numbers grouped in 3 and 4 respectively with a slash (-) between them.

We used the caret (^) to look for lines that began with certain characters. We can also look for lines that ends with specific characters with the dollar ($) sign. For example, if we wanted to know which users on our system use the bash shell, we would do this:

grep bash$ /etc/passwd

In this case, the shell used is the last thing listed on each line of the /etc/passwd file, so you'd see something like this as output:


Using grep with other commands

One of the more common uses of grep in administration tasks is to pipe other commands to it. For example, you're curious as to how much of your system's resources you're hogging .. eh hem.. using:

ps uax | grep $USER

Or how about looking for the files you created during the month of October:

ps -l | grep Oct

Again, we could go on for a long, long time, but we have other commands to look at. Needless to say, you'll find grep to be one of your most valuable tools.

Feeling a little awk(ward)awk is another one of those tools that will make your hunt for meaningful data a lot easier. awk is actually a programming language designed particularly for text manipulation, but it is widely used as an on-the-spot tool for administration.

For example, let's the return to the example I used earlier with grep. You want to see the processes that you're using. You can to this with awk as well.

ps uax | awk '/mike/'

does exactly the same as grep. However, the tool ps presents its data in a tabled format so awk is better suited to getting just the parts we want out of it more than grep is. The uax options we used above will present the information like so:

root      3355  0.0  0.1  1344  120 ?        S    Jul27  0:00 crond

So using awk, we can get specific information out of each column. For example, we could see what processes root is using and how much memory they're using. That would mean telling awk to get us the information about root's processes from columns 1,2 and 4:

ps uax | awk '/root/ {print $1,$2,$4}'

This would get us some information like this:

root 1 0.3
root 2 0.0
root 3 0.0
root 4 0.0
root 9 0.0
root 5 0.0
root 6 0.0
root 7 0.0
root 8 0.0
root 10 0.0
root 11 0.0
root 3466 0.0
root 3467 0.0
root 3468 0.0
root 3469 0.0
root 3512 0.0
root 3513 7.9
root 14066 0.0

You'll notice that PID 3513 shows more memory usage that the others, so you could further use awk to see what's going on there. We'll add the column 11 which will show us the program that's using that memory.

ps uax | awk '/3513/ {print $1,$2,$4,$11}

And you'll see that X-window is the program in question:

root 3513 7.6 /usr/X11R6/bin/X

We could even take this one step further. We could use grep to get a total of the memory that our processes are using:

ps uax | awk '/^mike/ { x += $4 } END { print "total memory: " x  }'

awk is nice enough to accommodate with the total:

total memory: 46.8

I am using almost half of the machine's memory. Nice to see that awk can do math too!

As it can do math, you can also use it to check out the total bytes of certain files. For example, the total bytes taken up by jpgs in your photo collection:

ls -l | awk '/jpg/ { x += $5 } END { print "total bytes: " x }'

Column 5 of ls -l shows the bytes of the file, and we just have awk add it all up.

Tengo 'sed'!

sed, though an excellent tool, has been largely replaced by features in Perl which duplicate it somewhat. I say somewhat, because the use of sed in conjunction with pipes is somewhat more comfortable (at least to this author) for substitution of text in files.

Sed is the Spanish word for thirst, Though you may thirst (or hunger, or whatever) for an easier way to do things, its name did not derive from there. It stands for Stream editor. Sed, as you saw in our example, is basically used for altering a text file without actually having to open it up in a traditional editor. This results in an incredible time savings for a system administrator.

One of the most common uses of sed is to alter or eliminate text in a file. Let's say you want to see who last logged into your network. You would run a utility called lastlog. This will show you the last time that every user logged in. The fact is though that we're only interested in real human users, not users that are really daemons or processes or literally "nobody". If you just ran lastlog you'd get a bunch of entries like this:

mail                                      **Never logged in**
news                                      **Never logged in**
uucp                                      **Never logged in**

You can, of course, make this output more meaningful and succinct by bringing sed into the picture. Just pipe the output of lastlog and have sed eliminate all of the entries that contain the word Never.

lastlog | sed '/Never/d' > last_logins

In this example, sed will delete (hence the 'd' option) every line that contains this word and send the output to a file called last_logins. This will only contain the real-life users, as you can see:

Username        Port    From            Latest
fred            pts/3    s57a.acme.com    Mon Jul 14 08:45:49 +0200 2011
sarah            pts/2    s52b.acme.com    Mon Jul 14 08:01:27 +0200 2011
harry            pts/6    s54d.acme.com    Mon Jul 14 07:56:20 +0200 2011
carol            pts/4    s53e.acme.com    Mon Jul 14 07:57:05 +0200 2011
carlos          pts/5    s54a.acme.com    Mon Jul 14 08:07:41 +0200 2011

So you've made a nice little report here of people's last login times for your boss. Well, nice until you realize that you've shown that you (fred) logged in at 08:45, three quarters of an hour after you were supposed to be at work. Not to worry. Sed can fix your late sleeping habit too. All we have to do is substitute 08 for 07 and it looks like you're a real go-getter.

cat last_logins | sed 's/08/07/g' > last_logins2

What you've done is use the 's' option (for substitute) to change every instance of 08 to 07. So now it looks like you came in at 7:45. What a go-getter you are now! The only problem is that you have made those changes throughout the whole file. That would change the login times for sarah and carlos as well to 7:01 and 7:07 respectively. They look like real early birds! That might raise suspicion, so you need a way to change only the 08 on the line that shows your (fred) login. Again, we sed will come to our rescue.

sed '/fred/s/08/07/g'

And sed will make sure that only the line containing fred gets changed to 07.

Sed, of course, is more than just a text substitution tool (and a way to hide your tendency to over-sleep on Monday). Let's go back to a previous example of looking for MS Windows-based worms hitting our web server. Sed can actually do a better job of 'grepping' than grep can do. Let's say you wanted to look at all the CodeRed hits for just the month of July, you would do this:

cat access | sed '/default.ida/!d; //Jul/!d' > MS_Exploits_July

The 'd' option, which we used before to delete text, can be used, with the exclamation point in front of it, to find text.

On the contrary, you may want to get a print out of just normal web traffic, sans all those Windows exploits. That's where just plain 'd' comes in handy again.

cat access | sed '/^.{200}/d' > normal_traffic

This will eliminate any line of less that 200 characters, which means it will show most requests for normal files, and not CodeRed or Nimda hits which are longer. Now, using reverse logic, we now have a different method CodeRed/Nimda hits by analyzing hits that are more than 200 characters in length. For this, we'll make a few modifications:

cat access | sed -n '/^.{220}/p' > MS_exploits

The difference with this example is the -n option. This tells sed to limit its action to the other option designated 'p', which means pattern. In other words, limit what you show to those lines matching a pattern of 220 characters in length or more.

There are a lot more ways you can use sed to make changes to files. One of the best ways to find out more about sed is to see how others use it and to learn from their examples. If you go to Google and type: +sed +one liners, you can get all kinds of examples of how sed can be used. It's an excellent way to polish your skills.

Using UniqUniq is a good tool for weeding out useless information in files. It's the Unix/Linux way of separating the wheat from the chaff, so to speak. Let's say that you were involved in some sort of experiment where you had to observe some time of behavior and report on it. Let's say you were observing somebody's sleep habits. You had to watch somebody sleeping and report on it every 10 minutes, or whenever there was a change. You might sit in front of your terminal and issue this command, for example:

echo `date +%y-%m-%d_AT_%T`  No changes >> sleep_experiment_43B

And when you see some changes, you could issue this command:

echo `date +%y-%m-%d_AT_%T`  subject moved right arm >> sleep_experiment_43B

You'd end up with a file that looks like this:

03-08-09_AT_23:10:16 No change
03-08-09_AT_23:20:24 No change
03-08-09_AT_23:30:29 No change
03-08-09_AT_23:40:31 No change
03-08-09_AT_23:50:33 No change
03-08-09_AT_00:00:34 No change
03-08-09_AT_00:10:35 No change
03-08-09_AT_00:20:37 No change
03-08-09_AT_00:30:05 subject rolled over
03-08-09_AT_00:40:12 No change
03-08-09_AT_00:50:13 No change
03-08-09_AT_01:00:50 subject moved left leg
03-08-09_AT_01:10:17 No change
03-08-09_AT_01:20:18 No change
03-08-09_AT_01:30:19 No change
03-08-09_AT_01:40:20 No change
03-08-09_AT_01:50:47 subject moved right arm
03-08-09_AT_02:00:11 No change
03-08-09_AT_02:10:20 subject scratched nose

If this file went on until 07:00, when the subject finally wakes up, you might have a lot of entries in there with 'no change', which is fairly uninteresting. What if you wanted to see just the changes in sleep behavior? Uniq is your tool. Uniq will show you just the 'unique' lines or lines with different information. You may be thinking: But all the lines are unique, really. That's correct. All the times are different, but we can adjust for that on the command line.

uniq -f 1 sleep_experiment_43B

This tells uniq to skip the first field, which is the data field and just list the unique fields. You'll end up with something like this:

03-08-09_AT_23:10:16    No change
03-08-09_AT_00:30:05    subject rolled over
03-08-09_AT_01:00:50    subject moved left leg
03-08-09_AT_01:50:47    subject moved right arm
03-08-09_AT_02:10:20    subject scratched nose

Now you're probably saying: I want to run Linux machines, not conduct sleep experiments. How do I use 'uniq' along these lines?. Well, let's go back to our example of looking for users last login times. If you remember, lastlog got that information for us, but it also listed users that weren't real people. We narrowed it down to real logged in users by invoking sed. The only drawback to that is that we got only users that had logged in. If you want to see all real users whether they have logged in or not you could do this:

lastlog | uniq -f 4

The fourth field of the output of lastlog is the one that indicates **Never logged in**, so This will find all real users whether or not they have logged in or not. This is good for finding out which users have actually used our system. Some people are given shell accounts and then they never use them. You can then weed these people out.

Sort this out

Another text manipulation tool that will come in handy is sort. This tool takes a text file or the output of another command and "sorts" it out (puts it in some order) according to the options to choose. Using sort without any options will just put the lines of a file in order. Let's imagine that you've got a grocery list that looks something like this:


In order to put this in alphabetical order, you'd just type:

sort grocery_list

That would give you a nice list starting with bleach and ending with spaghetti. But let's say you're smarter than you're average shopper and you have also written down where the items can be found, saving you time. Let's say you're list looks like this:

chocolate      aisle 3
ketchup        aisle 9
detergent      aisle 6
cola            aisle 5
chicken        meat dept
mustard        aisle 9
bleach          aisle 6
ham            deli counter
rice            aisle 4
bread          aisle 1
croissants      aisle 1
ice-cream      aisle 2
hamburgers      meat dept
cookies        aisle 3
spaghetti      aisle 4

To make your trip to the supermarket faster, you could sort the second column of the list like so:

sort +2 grocery_list

The +2 (+ [column]) means to sort it according to the second column. Now you'll get everything nicely sorted by section:

bread          aisle 1
croissants      aisle 1
ice-cream      aisle 2
chocolate      aisle 3
cookies        aisle 3
rice            aisle 4
spaghetti      aisle 4
cola            aisle 5
bleach          aisle 6
detergent      aisle 6
ketchup        aisle 9
mustard        aisle 9
ham            deli counter
chicken        meat dept
hamburgers      meat dept

Again, you're probably saying: Will being a more efficient shopper help me with my system administration tasks? The answer is: Yes, of course! But let's look at another example that has to so with our system.

tail is another one of the more useful commands on a system. tail will show you the last 10 lines of a file. But if you use 'cat' with a pipe to sort and 'more', you get a tail that's sort of interactive (I couldn't resist).

cat /var/log/mail.log | sort -r |more

The -r option stand for reverse, so here will see the entries in your mail server logs beginning with the most recent. As you push enter, you'll begin to go to the older entries.

Cut to the chase

Sometimes the output of a program gives you too much information. You'll have to cut some of it out. That's where the program cut comes in handy. Earlier we saw how to get only the IP addresses of the visitors to the website. Here's a refresher, with a twist.

cat access | cut -c1-16 > IP_visitors

This would get us a file with all of the IP addresses. We'll get back to this in a minute.

There are, of course, other practical uses. The other day a client asked me to let a user on the system use Hylafax, a fax server for Linux. One of the requirements of letting a user access Hylafax is knowing his or her user ID or UID. This information is located in /etc/passwd and there is a quick way to get at it:

cat /etc/passwd | grep bob | cut -f1,3 -d":"

Basically what we've done is to grep the user bob out of the passwd file and pass it to cut and take the first and third (1,3) fields which are separated (delimited -d) by a colon :)). The result is this:


User 'bob' has a UID of 1010.

Let's look at our Apache weblog example again. We can combine a few of the text manipulation tools we've learned to see how many unique IP addresses have visited our website.

cat access | cut -f1-2 -d" " | sort | uniq | wc -l

This is a great example of the Unix way of doing things. We've cut out the first field (-f1-2 - shows us only fields 1 to 2 delimited by whitespace -d""). We pipe it to sort, which puts them in numerical order. The uniq tool then shows only the "unique" IP address. Finally, 'wc -l' counts the lines. You really have to ask yourself: What would it cost me to get this info with a normal word processor or text editor?

Members online