Linux Online Advertisement
[ Register ]

[ Applications ]
[ Documentation ]
[ Distributions ]
[ Download Info ]
[ General Info ]
[ Book Store ]

Advertisement

[ Courses ]
[ News ]
[ People ]
[ Hardware ]
[ Vendors ]
[ Projects ]
[ Events ]
[ User Groups ]
[ User Area ]

Moving to Linux: Kiss the Blue Screen of Death Goodbye!

[ About Us ]
[ Home Page ]
[ Advertise ]

Intermediate Level User Linux Course

System Services

Linux was born on the network of networks, the Internet, so it's no surprise that Linux's main strength lies in providing services over a network. These services would include providing a web environment, email services, file sharing and print services, databases along with a security system to make sure that everything stays air-tight within your organization.

Webservers

Although the Internet existed decades before it became popular with the public, this popularity is mainly due to the invention of the World Wide Web. The pages that make up the WWW are all served from machines running a type of software that has become known as a webserver.

Apache webserver

The most popular web server by far is the Apache web server. It originated as a set of patches to provide functionality to the original httpd web server (the name Apache comes from "a patchy webserver"). It is released under its own open source license (called, unsurprisingly the Apache license) and it is available for a free download and comes with most major Linux distributions. The combination of Linux and the Apache webserver account for over 60 percent of the servers on the Internet.

Most major Linux distributions come with Apache and they offer you the possibility to install it. What's even better is that now most distributions will even configure Apache during the install process to work together with other complementary web development packages that you may have chosen to install as well. These might include PHP, mod_perl and mod_python. These advances in the ease of install are surely welcome. I remember installing by Apache from a tarball in the early days of my Linux experience and it was somewhat time consuming to get Apache to play well with all of these add-ons. This should not be an issue anymore. You can, of course, install from a tarball and get some really personalized configurations - but that goes way beyond the scope of this course. Although I normally don't like to use the expression way beyond the scope of ..., it is a fact that entire books are dedicated to Apache alone. What we will do is deal with ways to take advantage of some of Apache's features that you can get "out of the box".

Important

The Apache version we will deal with in this section is 1.3.x, which is the most widely used version out there. Apache has released 2.0 and some of the configuration options are the same, but some have changed. Do not use this as a guide to Apache 2.0 configuration.

httpd.conf

The main configuration file for the Apache webserver can be found, normally, in /etc/httpd or /etc/apache - depending on where your distribution chooses to place it. As I mentioned before, most distributions do a pretty good job of configuring a working web server, but you may want to change some things so Apache works more to your liking. Before making any changes though, I recommend making a copy of httpd.conf. It's a fairly large file and it's easy to make some change and then lose track of what you did. Then, if you find Apache's not working right, you can always go back to the original file. I usually do something like:

cp httpd.conf httpd.conf.YYYYMMDD

Where YYYYMMDD is the year, month and day. You are, of course, free to call it httpd.conf.charlie if you choose. This is really a good policy to follow when you change any config file, especially if you're dealing with services that are crucial to a company or organization. You can quickly get back to a working server and then figure out what went wrong later. Let's look at some things you can do to get Apache working to suit your needs.

Some basic security

Apache is designed so that every directory where you have created web content should have an index file. This is normally index.html, but you may also add other extensions, such as index.php, index.htm or others. The part of httpd.conf that determines this is:

#
# DirectoryIndex: Name of the file or files to use as a pre-written HTML
# directory index.  Separate multiple entries with spaces.
#
<IfModule mod_dir.c>
  DirectoryIndex index.html index.php3 index.php index.htm index.shtml index.cgi
</IfModule>

Apache, by default, is going to show us the directory listing if we don't have one of these files in a directory. That's probably not a good idea from a security standpoint. We all get lazy and we may place temporary files in a webserver that we don't mean for the world to see. The best thing is to nip this problem in the bud and keep Apache from showing directory listings. You need to find this line in httpd.conf:

Options Indexes Includes FollowSymLinks MultiViews

It's a good idea to remove the Indexes option here. This will prevent a website visitor from seeing what's in the individual directories.

Document root and cgi-bin

The document root means the directory where Apache serves the web pages from by default. You will see a line like this in your httpd.conf:

#
# This should be changed to whatever you set DocumentRoot to.
#
<Directory /var/www/>

You'll find that the Apache developers are good at explaining what things mean. That is, if you prefer your web pages to be in another place, you should change it here. Even if you want them in another place, you may not want to change this right away. Further along, I'll explain the concept of "virtual" websites, which means "hosting" more than one website. However, if you're only going to be serving one set of pages, you may change this to wherever you want. You may also want to have a look at this line:

ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

This is the directory where you can place your cgi-bin scripts. Those of us who have some web development experience will know what a cgi-bin script is. In case you don't, it's a program that's mean to be run from a form on a web page. If you've submitted data to a website in the past, chances are you've used a cgi-bin script. If you've created a webpage with a form to submit information, you'll normally find something like this on the page:

<form action="/cgi-bin/myscript.cgi" method="get">

Your script is placed in the cgi-bin directory and Apache knows where to find it when the form calls it. If you change the above line in Apache to have the scripts located someplace else, you also need to change a line a little farther below:

<Directory /usr/lib/cgi-bin/>
    AllowOverride None
    Options ExecCGI
    Order allow,deny
    Allow from all
</Directory>

Again, as I mentioned above, you may not even need to make these changes if you're going to be maintaining several websites on the same server. More on that further ahead.

Personal user sites

If you give somebody an account on the machine running Linux and Apache this person has the ability to run his/her own personal website. I'm sure many of you have seen sites like: http://www.domain.com/~larry/ . This is because the UserDir module is activated in httpd.conf:

LoadModule userdir_module /usr/lib/apache/1.3/mod_userdir.so

And farther down you will find this section:

#
# UserDir: The name of the directory which is appended onto a user's home
# directory if a ~user request is received.
#
<IfModule mod_userdir.c>
    UserDir public_html
</IfModule>

By default, Apache designates the directory where the public webfiles (and remember, these are public!) are found to be public_html. There's no reason why you can't change this name to website or any other meaningful name. You could even comment these lines out if you don't want the users on your system to have a personal website. If you do allow this, you may want to skip down to the next line:

#
# Control access to UserDir directories.  The following is an example
# for a site where these directories are restricted to read-only.
#
<Directory /home/*/public_html>

There are some options here as well as to how the site will work. You should remove the option Indexes from here as well, as we did earlier.

Server-side includes

As most of us webmasters know, websites, even fairly small ones, can become unruly and hard to maintain. One of the ways to keep your website management tasks to a minimum is to use server side includes. A server-side include is a way to make other files part of the page. That way, for example, you can have the same navigational buttons on every page of your website. This keeps you from having to include HTML code for this in every page you create. You would just insert something this in your pages:

<!--#include virtual="/navbar.shtml"  -->

Apache designates by default the extension .shtml to be used for server-side includes. If you want plain .html files to be used as server-side includes as well, you need to add this to httpd.conf. Open the file and look for these lines:

AddType text/html .shtml
AddHandler server-parsed .shtml

Now add these lines:

AddType text/html .html
AddHandler server-parsed .html

Alias directories

Some applications that run under Linux use the Apache webserver to display some of its content. There are systems to display man pages in the browser. Some Linux distributions use Apache to give you a web-based help system and documentation. They will place their documents outside of the root webserver directory. To access this "outside" content, we need to create "Alias" lines in httpd.conf or else it will be inaccessible from a web browser. In the following example, I'll show you what I need to add to httpd.conf so that visitors could see my mailman mailing list public archives.

I found the following line in httpd.conf:

#
# Aliases: Add here as many aliases as you need (with no limit). The format is
# Alias fakename realname

Then I added these lines:

# Aliases for mailman
Alias /pipermail/ /var/lib/mailman/archives/public/
Alias /images/ /usr/share/doc/mailman/images/

This means that a person only has to type http://www.mydomain.ork/pipermail/ into a browser to see the mailing lists located in /var/lib/mailman/archives/public/. If there are any images on the page, they will also be displayed.

Mailman also works by letting visitors sign up to the mailing list. The whole system is based on Python scripts. These scripts are not in the cgi-bin that Apache knows about. They are in a different place. So we also need to add these lines below so that Apache can find these scripts.

#ScriptAlias for mailman ScriptAlias /mailman/ /usr/lib/mailman/cgi-bin/

This is known as a ScriptAlias and you will find this section below the Alias section in httpd.conf.

As you can see, Apache is very versatile - allowing us to configure it to use web content from third-party applications with relative ease.

The .htaccess file

To help with website administration, Apache adds an additional configuration file, called .htaccess (yes, with a dot (.) in front of it) where you can add more options that effect how your website works.

No more 404s

As a web surfer, nothing annoys me more than a "404 not found" page. This is what Apache will show you by default when you request a page that has disappeared.

Note

404 is the Apache code for a request for a page that does not exist. Web-savvy people now refer to a missing page as a "404".

Not Found
The requested URL /bla.html was not found on this server.

Apache/1.3.26 Server at www.dominio.ork Port 80

As it's frustrating as a user to find this page, it's my job as a webmaster to make sure it doesn't appear. There is really no excuse for this occurring. The .htaccess provides a means to redirect users to content if you've moved it. Let's say you have a site that talks about a club you have set up. You have a page dedicated to your August 2002 barbecue. You've created a directory called /bbq. The club is successful and another year goes by and you have another barbecue - this time in August 2003. You decide to make the website more manageable and so you create two directories - bbq02 and bbq03 with pages about the festivities. Now, a problem arises. People might have bookmarked the page dedicated to the hilarious food fight at the 2002 shindig: http://www.ourclub.ork/bbq/foodfight.html. Now, of course, you've moved it. I would say that it's your duty as a good webmaster to provide a re-direct. Since /bbq no longer exists, we can create an .htaccess file in our webserver root directory and add the following entry.

# redirects

RedirectPermanent /bbq/foodfight.html http://www.ourclub.ork/bbq02/foodfight.html

You should add any and all web pages that you've moved to /bbq02 to your .htaccess file as well.

Friendly greetings

If you've done your work diligently in providing re-directs for moved pages, then you can be fairly confident that any 404s that are generated in your web logs are probably the result of things beyond your control. Users will often type bad URLs into their browsers and other webmasters may make mistakes providing a link to one of your pages. In these cases, it's probably a good idea to provide and alternative web page to replace Apache's standard 404 warning. Again, .htaccess provides you with this possibility. How elaborate a substitute page you provide depends on you and your imagination (and perhaps your good taste!)

Important

It's a good idea to use grep to look for 404s in your Apache access logs at least once a week or so. You may have re-directed users to other pages but you may have overlooked the fact that people may have bookmarked specific images as well. Apart from the ease-of-use issues, it is also a basic security measure. You may find one IP address generating a lot of 404s. This could be an individual checking out your site as a prelude to a defacing or other attack on your website. You may then want to take steps such as firewalling this IP from your network or, if the situation warrants, contacting the owner of the netblock.

First, as a website administrator, it's probably a good idea to create a directory for administrative needs. Call it what you like - something meaningful to you. Now you can create an alternative page for your 404s and place it in this directory. The page normally has a simply greeting- maybe something like: Oops! We can't find that. and maybe a link back to your home page. If you have search capabilities on the site, you may want to link to those. Again, it is up to you as a web administrator to create something that works for you and your site.

Password protection

Apache also provides a means of keeping people out of certain directories. Again, this depends on some lines placed in .htaccess. Let's go back to your club's website. You may want to create a members-only section to the website that's restricted to those to whom you've given a password. To do this, you would first create the directory and then create an .htaccess file in the directory. Then add the following lines:

AuthUserFile /home/club/.htpasswd
AuthGroupFile /dev/null
AuthName "Our Club - Members Only"
AuthType Basic
<Limit GET>
Require valid-user
</Limit>

Now you must create the file with the users and passwords in it, called .htpasswd. You will notice that we have placed it outside of the web directories as a security precaution. Apache can read it just fine there and there is no risk of it being read by a nasty spider. Here's how you create the .htpasswd file:

htpasswd -c /home/club/.htpasswd joe

Where joe is the first user in the file. That's important because the -c option creates the file. From now on, for every user you want to add, you don't use the -c option. Apache will ask you for the password twice, as is standard in Unix-type applications. Now, when you go to http://www.ourclub.ork/members/secret.html you will get this in your web browser:

Scripts in alternative locations

Another feature we can get via .htaccess is the ability to use scripts outside our cgi-bin directory. This is another good way to increase the manageability of your website. Let's say you have a section of your website for news about your club . You have it in a directory appropriately called /news. You may have a small Perl script that takes news items out of a MySQL database. You could create a directory in /news called /script and then create an .htaccess file with the following lines in it:

Options +ExecCGI
AddHandler cgi-script .cgi

Now, any script with the .cgi (dot-cgi) extension can be executed as a script. Normally Apache wouldn't allow that but these two lines will override that behavior. Of course, there is a good reason for this not being provided by default. It is a potential security risk. Most websites place their cgi-bin directory outside of the web directory - and for good reason. Any script can be executed from it. It's much more difficult for someone to get at the cgi-bin directory if it's in some other place. But if we place it inside a website's content directories, the possibility of someone manipulating it increases. If you do choose to use this feature, make sure that the scripts are well-written and free from exploitable bugs, such as cross-site scripting vulnerabilities and that few people - the fewer the better -have upload privileges.

robots.txt

Search engines like Google exist because the are able to make inventories of websites. Yahoo started out with a few individuals creating a directory of the limited number of pages that existed in the early 1990's. At the time of this writing, there are literally billions of pages now on the WWW, so it would be too costly to have humans to this manually. What Google and other search engines employ are automated robots. But you as a website maintainer may not want parts of your site to be inventoried by search engines - or you may not even want your site inventoried at all. To make sure that your wishes are respected, popular search engines will have their robots read a file called 'robots.txt' that is placed in the root directory of every website. robots.txt contains instructions for web crawlers, spiders and robots as to which directories are off limits A robots.txt file that does not allow any prying robot eyes will look like this:

User-agent: *
Disallow: /

The asterisk means any user agent. And the slash / means the root directory and anything in it, which includes subdirectories. In other words, the whole site is off limits to any robot. This is a bit strict. This would definitely not do for a website maintainer who was looking to increase search engine ranking. You probably want to be a bit more lenient:

User-agent: *
Disallow: /admin
Disallow: /reports

This would allow robots to make an inventory of your site except for the two directories /admin and /reports, which you have chosen to restrict their access to.

You can also specify the type of robots you want kept off the site by naming them specifically after User-agent: . You can even have several sections to your robots.txt file for different circumstances.

User-agent: webcrawler
Disallow: /managers
Disallow: /docs

User-agent: lycos
Disallow: /managers
Disallow: /docs
Disallow: /how-to

User-agent: evilrobot
Disallow: /

User-agent: *
Disallow: /managers

What you exclude is up to you (or your organization's policy making body).

Warning

A couple of caveats about robots.txt

  • robots.txt is not intended as access restriction for humans. The directories above are intended for viewing without restriction by those who know about them. They just won't end up in any search engine that knows how to play nice. The simple fact that you put them in robots.txt will indicate that there is something there that you don't really want the whole World Wide Web to know about. Since robots.txt is available for public viewing, that means that individuals can look at it to see what you consider somewhat private. Curious people, of course, will immediately start exploring. If you have anything moderately sensitive there, protect it with a password in the way I explained above. If it's really sensitive, it shouldn't even be on a public webserver, regardless of password protection.

  • Some robots don't play nice. There are robots that go looking for email addresses on pages to sell to spammers. There are some that even go looking for physical addresses that you might have listed somewhere to sell them to telemarketers. Your robots.txt means nothing to them. They laugh in your face. The best policy is to obfuscate email addresses on pages (bill**AT**domain.ork) and not to put individuals' personal info on pages.

Virtual hosting of websites

When the hosting service boom came toward the middle and late 1990's most of this was due primarily to Apache and the ability to create "virtual" hosts. This allowed multiple websites to be hosted on one server - as long as bandwidth and load balancing allowed for it. For sites on the public web, the most common way to do this is to take advantage of Apache's NameVirtualHost directive.

If you go to the end of the httpd.conf file, you'll find the section for virtual hosting. It begins with these lines:

# If you want to use name-based virtual hosts you need to define at
# least one IP address (and port number) for them.
#
#NameVirtualHost 12.34.56.78:80
NameVirtualHost 192.168.0.25

What I've done here is to choose the IP of the machine that will host the websites and define it. Next, we need to set up space to house the various websites. I normally choose /home/[website] (where [website] is the name of the "user" who will be administering the site. This may or may not be a real user. If you set up a website to sell trinkets, say, www.trinkets.biz, you may create a user called 'trinkets'. This is basically up to you. Nevertheless, you would have a directory called /home/trinkets. In this directory, you should place a directory for web content. Call it something meaningful like /www or /html or /web. Then create a directory for your Apache access and error logs; /logs will be fine. Then create a directory for your cgi-bin scripts /cgi-bin is pretty much mandatory in this case. Next, you need to create the virtual host section for Apache. If you notice, under the NameVirtualHost directive you will find this line:

# VirtualHost example:
# Almost any Apache directive may go into a VirtualHost container.
#
#<VirtualHost ip.address.of.host.some_domain.com>
#    ServerAdmin webmaster@host.some_domain.com

You can start putting your virtual hosts here inside httpd.conf if you want. I personally don't do this. Apache provides the option use "includes" or, in other words, it lets you tack on other files to your httpd.conf. I prefer to do this as it helps me maintain different sites better. So I would create a file in the same directory as my httpd.conf called trinkets.conf. At the end of httpd.conf I would add the line:

Include /etc/apache/trinkets.conf

Apache will now read that file as well when it starts up. In trinkets.conf you need to place the following:

##############################################
# VIRTUAL HOST WWW.TRINKETS.BIS #
##############################################

<VirtualHost 192.168.0.25>
   ServerAdmin            trinkets@trinkets.bis
   DocumentRoot           /home/trinkets/www/
   ServerName             www.trinkets.bis
   ErrorLog               /home/trinkets/logs/error
   TransferLog            /home/trinkets/logs/access
   ScriptLog              /home/trinkets/logs/script
   ScriptAlias            /cgi-bin/  /home/trinkets/cgi-bin/
   
   <Directory /home/trinkets/www/>
     Options Includes ExecCGI FollowSymLinks
     AllowOverride All
   </Directory>
<Directory /home/trinkets/cgi-bin/>
     AllowOverride None
     Options ExecCGI FollowSymLinks
   </Directory>
</VirtualHost>

##############################################
# END HOST WWW.TRINKETS.BIS #
##############################################

This example supposes that I have a machine on my local network that web requests are forwarded to from a router. This machine is at IP 192.168.0.25 locally. I have created either a user 'trinkets' or a just directory /home/trinkets. In this directory I have created a directory /www for web content, /cgi-bin for scripts and /log for log files. You're pretty much all ready to go. You can add more of these if you want. That's the whole point to virtual hosting. Of course, you need to be able to handle the load.

A head start

As I stated at the beginning, this wasn't meant to be an exhaustive guide to using the Apache webserver. There are plenty of good books out there that give you much more information about it. But this should give you a head start in understanding some of the basic concepts about Apache.



Comments: feedback (at) linux.org
Advertising: banners (at) linux.org
Copyright Linux Online Inc.
Compilation ©1994-2008 Linux Online, Inc.
All rights reserved.