Although the Internet existed decades before it became popular with the public, this popularity is mainly due to the invention of the World Wide Web. The pages that make up the WWW are all served from machines running a type of software that has become known as a webserver. Apache webserver The most popular web server by far is the Apache web server. It originated as a set of patches to provide functionality to the original httpd web server (the name Apache comes from "a patchy webserver"). It is released under its own open source license (called, unsurprisingly the Apache license) and it is available for a free download and comes with most major Linux distributions. The combination of Linux and the Apache webserver account for over 60 percent of the servers on the Internet. Most major Linux distributions come with Apache and they offer you the possibility to install it. What's even better is that now most distributions will even configure Apache during the install process to work together with other complementary web development packages that you may have chosen to install as well. These might include PHP, mod_perl and mod_python. These advances in the ease of install are surely welcome. I remember installing by Apache from a tarball in the early days of my Linux experience and it was somewhat time consuming to get Apache to play well with all of these add-ons. This should not be an issue anymore. You can, of course, install from a tarball and get some really personalized configurations - but that goes way beyond the scope of this course. Although I normally don't like to use the expression way beyond the scope of ..., it is a fact that entire books are dedicated to Apache alone. What we will do is deal with ways to take advantage of some of Apache's features that you can get "out of the box". httpd.conf The main configuration file for the Apache webserver can be found, normally, in /etc/httpd or /etc/apache - depending on where your distribution chooses to place it. As I mentioned before, most distributions do a pretty good job of configuring a working web server, but you may want to change some things so Apache works more to your liking. Before making any changes though, I recommend making a copy of httpd.conf. It's a fairly large file and it's easy to make some change and then lose track of what you did. Then, if you find Apache's not working right, you can always go back to the original file. I usually do something like: Code: cp httpd.conf httpd.conf.YYYYMMDD Where YYYYMMDD is the year, month and day. You are, of course, free to call it httpd.conf.charlie if you choose. This is really a good policy to follow when you change any config file, especially if you're dealing with services that are crucial to a company or organization. You can quickly get back to a working server and then figure out what went wrong later. Let's look at some things you can do to get Apache working to suit your needs. Some basic security Apache is designed so that every directory where you have created web content should have an index file. This is normally index.html, but you may also add other extensions, such as index.php, index.htm or others. The part of httpd.conf that determines this is: Code: # # DirectoryIndex: Name of the file or files to use as a pre-written HTML # directory index. Separate multiple entries with spaces. # <IfModule mod_dir.c> DirectoryIndex index.html index.php3 index.php index.htm index.shtml index.cgi </IfModule> Apache, by default, is going to show us the directory listing if we don't have one of these files in a directory. That's probably not a good idea from a security standpoint. We all get lazy and we may place temporary files in a webserver that we don't mean for the world to see. The best thing is to nip this problem in the bud and keep Apache from showing directory listings. You need to find this line in httpd.conf: Code: Options Indexes Includes FollowSymLinks MultiViews It's a good idea to remove the Indexes option here. This will prevent a website visitor from seeing what's in the individual directories. Document root and cgi-bin The document root means the directory where Apache serves the web pages from by default. You will see a line like this in your httpd.conf: Code: # # This should be changed to whatever you set DocumentRoot to. # <Directory /var/www/> You'll find that the Apache developers are good at explaining what things mean. That is, if you prefer your web pages to be in another place, you should change it here. Even if you want them in another place, you may not want to change this right away. Further along, I'll explain the concept of "virtual" websites, which means "hosting" more than one website. However, if you're only going to be serving one set of pages, you may change this to wherever you want. You may also want to have a look at this line: Code: ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/ This is the directory where you can place your cgi-bin scripts. Those of us who have some web development experience will know what a cgi-bin script is. In case you don't, it's a program that's mean to be run from a form on a web page. Your script is placed in the cgi-bin directory and Apache knows where to find it when the form calls it. If you change the above line in Apache to have the scripts located someplace else, you also need to change a line a little farther below: Code: <Directory /usr/lib/cgi-bin/> AllowOverride None Options ExecCGI Order allow,deny Allow from all </Directory> Again, as I mentioned above, you may not even need to make these changes if you're going to be maintaining several websites on the same server. More on that further ahead. Personal user sites If you give somebody an account on the machine running Linux and Apache this person has the ability to run his/her own personal website. I'm sure many of you have seen sites like: http://www.domain.com/~larry/ . This is because the UserDir module is activated in httpd.conf: Code: LoadModule userdir_module /usr/lib/apache/1.3/mod_userdir.so And farther down you will find this section: Code: # # UserDir: The name of the directory which is appended onto a user's home # directory if a ~user request is received. # <IfModule mod_userdir.c> UserDir public_html </IfModule> By default, Apache designates the directory where the public webfiles (and remember, these are public!) are found to be public_html. There's no reason why you can't change this name to website or any other meaningful name. You could even comment these lines out if you don't want the users on your system to have a personal website. If you do allow this, you may want to skip down to the next line: Code: # # Control access to UserDir directories. The following is an example # for a site where these directories are restricted to read-only. # <Directory /home/*/public_html> There are some options here as well as to how the site will work. You should remove the option Indexesfrom here as well, as we did earlier. Alias directives Some applications that run under Linux use the Apache webserver to display some of its content. There are systems to display man pages in the browser. Some Linux distributions use Apache to give you a web-based help system and documentation. They will place their documents outside of the root webserver directory. To access this "outside" content, we need to create "Alias" lines in httpd.conf or else it will be inaccessible from a web browser. In the following example, I'll show you what I need to add to httpd.conf so that visitors could see my mailman mailing list public archives. I found the following line in httpd.conf: Code: # # Aliases: Add here as many aliases as you need (with no limit). The format is # Alias fakename realname Then I added these lines: Code: # Aliases for mailman Alias /pipermail/ /var/lib/mailman/archives/public/ Alias /images/ /usr/share/doc/mailman/images/ This means that a person only has to type http://www.mydomain.ork/pipermail/ into a browser to see the mailing lists located in /var/lib/mailman/archives/public/. If there are any images on the page, they will also be displayed. As you can see, Apache is very versatile - allowing us to configure it to use web content from third-party applications with relative ease. The .htaccess file To help with website administration, Apache adds an additional configuration file, called .htaccess (yes, with a dot (.) in front of it) where you can add more options that effect how your website works. No more 404s As a web surfer, nothing annoys me more than a "404 not found" page. This is what Apache will show you by default when you request a page that has disappeared. 404 is the Apache code for a request for a page that does not exist. Web-savvy people now refer to a missing page as a "404". It's a good idea to use grep to look for 404s in your Apache access logs at least once a week or so. You may have re-directed users to other pages but you may have overlooked the fact that people may have bookmarked specific images as well. Apart from the ease-of-use issues, it is also a basic security measure. You may find one IP address generating a lot of 404s. This could be an individual checking out your site as a prelude to a defacing or other attack on your website. You may then want to take steps such as firewalling this IP from your network or, if the situation warrants, contacting the owner of the netblock. First, as a website administrator, it's probably a good idea to create a directory for administrative needs. Call it what you like - something meaningful to you. Now you can create an alternative page for your 404s and place it in this directory. The page normally has a simply greeting- maybe something like: Oops! We can't find that. and maybe a link back to your home page. If you have search capabilities on the site, you may want to link to those. Again, it is up to you as a web administrator to create something that works for you and your site. Password protection Apache also provides a means of keeping people out of certain directories. Again, this depends on some lines placed in .htaccess. Let's go back to your club's website. You may want to create a members-only section to the website that's restricted to those to whom you've given a password. To do this, you would first create the directory and then create an .htaccess file in the directory. Then add the following lines: Code: AuthUserFile /home/club/.htpasswd AuthGroupFile /dev/null AuthName "Our Club - Members Only" AuthType Basic <Limit GET> Require valid-user </Limit> Now you must create the file with the users and passwords in it, called .htpasswd. You will notice that we have placed it outside of the web directories as a security precaution. Apache can read it just fine there and there is no risk of it being read by a nasty spider. Here's how you create the .htpasswd file: Code: htpasswd -c /home/club/.htpasswd joe Where joe is the first user in the file. That's important because the -c option creates the file. From now on, for every user you want to add, you don't use the -c option. Apache will ask you for the password twice, as is standard in Unix-type applications. Now, when you go to http://www.ourclub.ork/members/secret.html you will get this in your web browser: Scripts in alternative locations Another feature we can get via .htaccess is the ability to use scripts outside our cgi-bin directory. This is another good way to increase the manageability of your website. Let's say you have a section of your website for news about your club . You have it in a directory appropriately called /news. You may have a small Perl script that takes news items out of a MySQL database. You could create a directory in /news called /script and then create an .htaccess file with the following lines in it: Code: Options +ExecCGI AddHandler cgi-script .cgi Now, any script with the .cgi (dot-cgi) extension can be executed as a script. Normally Apache wouldn't allow that but these two lines will override that behavior. Of course, there is a good reason for this not being provided by default. It is a potential security risk. Most websites place their cgi-bin directory outside of the web directory - and for good reason. Any script can be executed from it. It's much more difficult for someone to get at the cgi-bin directory if it's in some other place. But if we place it inside a website's content directories, the possibility of someone manipulating it increases. If you do choose to use this feature, make sure that the scripts are well-written and free from exploitable bugs, such as cross-site scripting vulnerabilities and that few people - the fewer the better -have upload privileges. robots.txt Search engines like Google exist because the are able to make inventories of websites. Yahoo started out with a few individuals creating a directory of the limited number of pages that existed in the early 1990's. At the time of this writing, there are literally billions of pages now on the WWW, so it would be too costly to have humans to this manually. What Google and other search engines employ are automated robots. But you as a website maintainer may not want parts of your site to be inventoried by search engines - or you may not even want your site inventoried at all. To make sure that your wishes are respected, popular search engines will have their robots read a file called 'robots.txt' that is placed in the root directory of every website. robots.txt contains instructions for web crawlers, spiders and robots as to which directories are off limits A robots.txt file that does not allow any prying robot eyes will look like this: Code: User-agent: * Disallow: / The asterisk means any user agent. And the slash / means the root directory and anything in it, which includes subdirectories. In other words, the whole site is off limits to any robot. This is a bit strict. This would definitely not do for a website maintainer who was looking to increase search engine ranking. You probably want to be a bit more lenient: Code: User-agent: * Disallow: /admin Disallow: /reports This would allow robots to make an inventory of your site except for the two directories /admin and /reports, which you have chosen to restrict their access to. You can also specify the type of robots you want kept off the site by naming them specifically after User-agent: . You can even have several sections to your robots.txt file for different circumstances. Code: User-agent: webcrawler Disallow: /managers Disallow: /docs User-agent: lycos Disallow: /managers Disallow: /docs Disallow: /how-to User-agent: evilrobot Disallow: / User-agent: * Disallow: /managers What you exclude is up to you (or your organization's policy making body).