Use rsync (see http://rsync.samba.org/) to mirror the Linuxfocus server. rsync minimizes network traffic and mirror time. It is the fastest and easiest way to ensure that your site is always up to date. Don't use any web crawler or ftp! Those methods are slow and generate a lot of load on the main server.
Take a look at the following script. You should use such a script to mirror LinuxFocus. Note that the domain to mirror from is rsync.linuxfocus.org and not www.linuxfocus.org.
#!/bin/sh # Please contact guido@linuxfocus.org if you have any # any questions. #----------------------------------------- ## put something like this into the crontab of a user who has write ## permissions on your web-server: ## run the synclf script every day at 2:33 in the night: #39 2 * * * /home/xxx/synclf # #----------------------------------------- # ensure that you webserver can read new files: umask 022 #----------------------------------------- # the directory of the LinuxFocus mirror page (please edit this line): target=/http/linuxfocus # # You can uncomment the following line for debug purposes: #DEBUG="yes" # if [ "$DEBUG" = "yes" ]; then echo "debug output will be written to the file /tmp/synclf.$$ ..." echo "start rsync with rsync.linuxfocus.org" > /tmp/synclf.$$ date >> /tmp/synclf.$$ rsync -rLtz -vv --delete rsync.linuxfocus.org::lf/ $target >> /tmp/synclf.$$ 2>&1 exit 0 fi # Normally (debug off) the following will be executed: # rsync -rLtz --delete rsync.linuxfocus.org::lf/ $target # #-------------- End of rsync script --------------- # You can get rsnyc at ftp://rsync.samba.org/pub/rsync/ # or http://rsync.samba.org/ #The above script is an example for downloading the dynamic html pages. As an alternative you can get static pages from rsync.linuxfocus.org::statichtml/ instead of rsync.linuxfocus.org::lf/
You should mirror LinuxFocus once a day in low traffic hours from 23:00 to 5:00 in the night (UTC / GMT).
Create a text file, called crontab.txt, with the following data (please vary the time a bit):
# run the synclf script every day at 2:45 in the night: 45 2 * * * /home/where/ever/you/put/it/synclfand then activate it with the command
# To use server-parsed HTML files AddType text/html .shtml AddHandler server-parsed .shtmlBoth the #exec command and #include must be enabled (see http://www.apache.org/docs and search for "Server Side Includes"). You need the directory option +Includes and you must include mod_include (Options Indexes +Includes).
The #exec command is need as linuxfocus web pages execute a perl script called lfdynahead.pl This script sets the links between the different languages. You can take a look at it if you want. It is in the document root directory of linuxfocus.org.
You can see that SSI is working if you have in the articles the
line at the top that says "This article is available in:....." as shown on
the following picture:
/usr/bin/perl is the standard path to perl under Linux. Any common linux distribution will have perl in that location.
DirectoryIndex index.html index.shtml
Netscape, Internet Explorer and many other Browsers do not handle the html tag "META HTTP-EQUIV" correctly.
<META http-equiv="Content-Type" content="text/html; charset=....">It is almost impossible to work around this problem. It does not affect the normal Latin 1 encoding (iso-8859-1) but Chinese, Russian and other languages need to have the correct encoding.
AccessFileName .htaccess # to the Directory directive for the directory where # LinuxFocus files are you must add: AllowOverride FileInfo # or: AllowOverride All # but FileInfo is the minimumOur .htaccess files (you don't need to worry about them, this is just for your information) contain something like:
AddType "text/html; charset=gb2312" .html AddType "text/html; charset=gb2312" .shtml
NameVirtualHost * # # <VirtualHost *> ServerAdmin webmaster@your.host DocumentRoot /home/httpd/html/linuxfocus ServerName something.linuxfocus.org ServerAlias www.something.linuxfocus.org somethingelse.org AddType text/html .shtml AddHandler server-parsed .shtml AccessFileName .htaccess <Directory "/home/httpd/html/linuxfocus"> # remember to load the mod_include (LoadModule and AddModule) # at the top of your configuration file Options Indexes +Includes DirectoryIndex index.html index.shtml AllowOverride FileInfo # Controls who can get stuff from this server. Order allow,deny Allow from all </Directory> </VirtualHost>