QSF(1)				 User Manuals				QSF(1)

NAME
       qsf - quick spam filter

SYNOPSIS
       Filtering:	qsf [-snrAtav] [-d DB] [-g DB]
			    [-L LVL] [-S SUBJ] [-H MARK] [-Q NUM]
			    [-X NUM]
       Training:	qsf -T SPAM NONSPAM [MAXROUNDS] [-d DB]
       Retraining:	qsf -[m|M] [-d DB] [-w WEIGHT] [-ayN]
       Database:	qsf -[p|D|R|O] [-d DB]
       Database merge:	qsf -E OTHERDB [-d DB]
       Allowlist query: qsf -e EMAIL [-m|-M|-t] [-d DB] [-g DB]
       Denylist query:	qsf -y -e EMAIL [-m -m|-M -M|-t] [-d DB] [-g DB]
       Help:		qsf -[h|l|V]

DESCRIPTION
       qsf  reads  a single email on standard input, and by default outputs it
       on standard output.  If the email is determined to be  spam,  an	 addi-
       tional header ("X-Spam: YES") will be added, and optionally the subject
       line can have "[SPAM]" prepended to it.

       qsf is intended to be used in a procmail(1) recipe, in a	 ruleset  such
       as this:

	       :0 wf
	       | qsf -ra

	       :0 H:
	       * X-Spam: YES
	       $HOME/mail/spam

       For  more examples, including sample procmail(1) recipes, see the EXAM-
       PLES section below.

TRAINING
       Before qsf can be used properly, it needs to be trained.	 A good way to
       train qsf is to collect a copy of all your email into two folders - one
       for spam, and one for non-spam.	Once you have done this, you  can  use
       the training function, like this:

	       qsf -aT spam-folder non-spam-folder

       This  will generate a database that can be used by qsf to guess whether
       email received in the future is spam or not.  Note  that	 this  initial
       training	 run  may  take a long time, but you should only need to do it
       once.

       To mark a single message as spam, pipe it to qsf with  the  --mark-spam
       or  -m  ("mark as spam") option.	 This will update the database accord-
       ingly and discard the email.

       To mark a single message as non-spam, pipe it to qsf with  the  --mark-
       nonspam	or  -M	("mark as non-spam") option.  Again, this will discard
       the email.

       If a message has been mis-tagged, simply send it to qsf as the opposite
       type,  i.e.  if it has been mistakenly tagged as spam, pipe it into qsf
       --mark-nonspam --weight=2 to  add  it  to  the  non-spam	 side  of  the
       database with double the usual weighting.

OPTIONS
       The qsf options are listed below.

       -d, --database [TYPE:]FILE
	      Use  FILE	 as the spam/non-spam database.	 The default is to use
	      /var/lib/qsfdb and, if that is not available  or	is  read-only,
	      $HOME/.qsfdb.  This option can also be useful if there is a sys-
	      tem-wide database but you do not want to	use  it	 -  specifying
	      your own here will override the default.

	      If   you	 prefix	  the  filename	 with  a  TYPE,	 of  the  form
	      btree:$HOME/.qsfdb, then this will specify what kind of database
	      FILE is, such as list, btree, gdbm, sqlite and so on.  Check the
	      output of qsf -V to see which database backends  are  available.
	      The default is to auto-detect the type, or, if the file does not
	      already exist, use list.	Note that TYPE is not  case-sensitive.

       -g, --global [TYPE:]FILE
	      Use   FILE   as	the   default	global	database,  instead  of
	      /var/lib/qsfdb.  If you also specify a database  with  -d,  then
	      this  "global"  database	will be used in read-only mode in con-
	      junction with the read-write database specified with -d.	The -g
	      option  can  be  used a second time to specify a third database,
	      which will also be used in read-only mode.  Again, the  filename
	      can  optionally  be  prefixed  with  a  TYPE which specifies the
	      database type.

       -P, --plain-map FILE
	      Maintain a mapping of all database tokens	 to  their  non-hashed
	      counterparts in FILE, one token per line.	 This can be useful if
	      you want to be able to list the contents of your database	 at  a
	      later  date,  for	 instance  to get a list of email addresses in
	      your allow-list.	Note that using this option may slow qsf down,
	      and  only	 entries  written to the database while this option is
	      active will be stored in FILE.

       -s, --subject
	      Rewrite the Subject line of any email that turns out to be spam,
	      adding "[SPAM]" to the start of the line.

       -S, --subject-marker SUBJECT
	      Instead  of  adding "[SPAM]", add SUBJECT to the Subject line of
	      any email that turns out to be spam.  Implies -s.

       -H, --header-marker MARK
	      Instead of setting the X-Spam header to "YES", set it to MARK if
	      email  turns  out	 to be spam.  This can be useful if your email
	      client can only search all headers for a string, rather than one
	      particular  header (so searching for "YES" might match more than
	      just the output of qsf).

       -n, --no-header
	      Do not add an X-Spam header to messages.

       -r, --add-rating
	      Insert an additional header X-Spam-Rating which is a  rating  of
	      the  "spamminess"	 of  a message from 0 to 100; 90 and above are
	      counted as spam, anything under 90 is not considered  spam.   If
	      combined with -t, then the rating (0-100) will be output, on its
	      own, on standard output.

       -A, --asterisk
	      Insert an additional  header  X-Spam-Level  which	 will  contain
	      between 0 and 20 asterisks (*), depending on the spam rating.

       -t, --test
	      Instead  of  passing  the message out on standard output, output
	      nothing, and exit 0 if the message is not spam, or exit 1 if the
	      message is spam.	If combined with -r, then the spam rating will
	      be output on standard output.

       -a, --allowlist
	      Enable the allow-list.  This causes the email addresses given in
	      the  message's  "From:" and "Return-Path:" headers to be checked
	      against a list; if either	 one  matches,	then  the  message  is
	      always  treated  as  non-spam,  regardless  of  what  the	 token
	      database says. When specified with  a  retraining	 flag,	-a  -m
	      (mark  as	 spam) will remove that address from the allow-list as
	      well as marking the message as spam, and -a  -M  (mark  as  non-
	      spam) will add that address to the allow-list as well as marking
	      the message as non-spam.	The idea is that you add all  of  your
	      friends  to the allow-list, and then none of their messages ever
	      get marked as spam.

       -y, --denylist
	      Enable the deny-list.  This causes the email addresses given  in
	      the  message's  "From:" and "Return-Path:" headers to be checked
	      against a second list; if either one matches, then theh  message
	      is  always  treated  as spam.  Training works in the same way as
	      with -a, except that you must specify -m or -M twice  to	modify
	      the  deny-list  instead  of the allow-list, and with the reverse
	      syntax: -y -m -m (mark as spam) will add	that  address  to  the
	      deny-list,  whereas -y -M -M (mark as non-spam) will remove that
	      address from the deny-list.  This	 double	 specification	is  so
	      that  the	 usual retraining process never touches the deny-list;
	      the deny-list should be carefully maintained rather  than	 auto-
	      matically generated.

	      Normally you would not need to use the deny-list.

       -L, --level, --threshold LEVEL
	      Change  the  spam	 scoring threshold level which must be reached
	      before an email is classified as spam.  The default is 90.

       -Q, --min-tokens NUM
	      Only give a score if more than NUM tokens are found in the  mes-
	      sage  -  otherwise the message is assumed to be non-spam, and it
	      is not modified in any way.  The	default	 is  0.	  This	option
	      might  be	 useful if you find that very short messages are being
	      frequently miscategorised.

       -e, --email, --email-only EMAIL
	      Query or update the  allow-list  entry  for  the	email  address
	      EMAIL.   With no other options, this will simply output "YES" if
	      EMAIL is in the allow-list, or "NO" if it is not.	 With  -t,  it
	      will  not output anything, but will exit 0 (success) if EMAIL is
	      in the allow-list, or 1 (failure) if it  is  not.	 With  the  -m
	      (mark-spam) option, any previous allow-list entry for EMAIL will
	      be removed. Finally, with the -M	(mark-nonspam)	option,	 EMAIL
	      will be added to the allow-list if it is not already on it.

	      If  EMAIL is just the word MSG on its own, then an email will be
	      read from standard input, and the email addresses given  in  the
	      "From:" and "Return-Path:" headers will be used.

	      Using -e automatically switches on -a.

	      If  you also specify -y, then the deny-list will be operated on.
	      Remember that -m and -M are reversed with the deny-list.

	      If you specify an email address of  the  form  @domain  (nothing
	      before  the  @),	then  the  whole  domain will be allow or deny
	      listed.

       -v, --verbose
	      Add extra X-QSF-Info headers to any filtered  email,  containing
	      error  messages  and  so on if applicable.  Specify -v more than
	      once to increase verbosity.

       -T, --train SPAM NONSPAM [MAXROUNDS]
	      Train the database using the two mbox folders SPAM and  NONSPAM,
	      by testing each message in each folder and updating the database
	      each time a message is miscategorised.   This  is	 done  several
	      times, and may take a while to run.  Specify the -a (allow-list)
	      flag to add every sender in the NONSPAM folder  to  your	allow-
	      list  as a side-effect of the training process.  If MAXROUNDS is
	      specified, training will end after this number of rounds if  the
	      results  are  still not good enough. The default is a maximum of
	      200 rounds.

       -m, --mark-spam
	      Instead of passing the message out on standard output, mark  its
	      contents	as  spam  and update the database accordingly.	If the
	      allow-list (-a) is enabled, the message's "From:"	 and  "Return-
	      Path:"  addresses are removed from the allow-list.  If the deny-
	      list (-y) is enabled and you specify  -m	twice,	the  message's
	      addresses are added to the deny-list instead.

       -M, --mark-nonspam
	      Instead  of passing the message out on standard output, mark its
	      contents as non-spam and update the  database  accordingly.   If
	      the  allow-list  (-a)  is	 enabled,  the	message's  "From:" and
	      "Return-Path:" addresses are added to the allow-list (see the -a
	      option above).  If the deny-list (-y) is enabled and you specify
	      -M twice, the message's addresses are removed from the deny-list
	      instead.

       -w, --weight WEIGHT
	      When  marking  as	 spam  or non-spam, update the database with a
	      weighting of WEIGHT per token instead of the default of 1.  Use-
	      ful when correcting mistakes, eg a message that has been mistak-
	      enly detected as spam should  be	marked	as  non-spam  using  a
	      weighting	 of  2, i.e. double the usual weighting, to counteract
	      the error.

       -D, --dump [FILE]
	      Dump the contents of the database as a platform-independent text
	      file, suitable for archival, transfer to another machine, and so
	      on.  The data is output on stdout or into the given FILE.

       -R, --restore [FILE]
	      Rebuild the database from scratch from the text file  on	stdin.
	      If  a  FILE  is  given,  data is read from there instead of from
	      stdin.

       -O, --tokens
	      Instead of filtering, output a list of the tokens found  in  the
	      message read from standard input, along with the number of times
	      each token was found.  This is only useful if you	 want  to  use
	      qsf  as a general tokeniser for use with another filtering pack-
	      age.

       -E, --merge OTHERDB
	      Merge the OTHERDB database into the current database.  This  can
	      be  useful  if  you want to take one user's mailbox and merge it
	      into the system-wide one, for instance (this would be  done  by,
	      as  root,	 doing	qsf -d /var/lib/qsfdb -E /home/user/.qsfdb and
	      then removing /home/user/.qsfdb).

       -B, --benchmark SPAM NONSPAM [MAXROUNDS]
	      Benchmark the training process using the two mbox	 folders  SPAM
	      and  NONSPAM.  A temporary database is created and trained using
	      the first 75% of the messages  in	 each  folder,	and  then  the
	      entire  contents	of each folder is tested to see how many false
	      positives and false negatives occur. Some timing information  is
	      also displayed.

	      This can be used to decide which backend is best on your system.
	      Use -d to select a backend, eg qsf -B spam  nonspam  -d  GDBM  -
	      this  will  create  a temporary database which is removed after-
	      wards.

	      The exception to	this  is  the  MySQL  backend,	where  a  full
	      database	    specification      must	 be	 given	   (-d
	      MySQL:database=db;host=localhost;...)  and  the  database	 table
	      given will not be wiped beforehand or dropped afterwards.

	      As  with	-T,  if MAXROUNDS is specified, training will never be
	      done for more than this number of rounds; the default is 200.

       -h, --help
	      Print a usage message on standard output and exit	 successfully.

       -l, --license
	      Print  details  of  the program's license on standard output and
	      exit successfully.

       -V, --version
	      Print  version  information,  including  a  list	of   available
	      database backends, on standard output and exit successfully.

DEPRECATED OPTIONS
       The  following  options	are  only  for	use  with  the old binary tree
       database backend or old databases that haven't been upgraded to the new
       format that came in with version 1.1.0.

       -N, --no-autoprune
	      When  marking  as spam or nonspam, never automatically prune the
	      database.	 Usually the database is pruned after every 500 marks;
	      if  you  would  rather --prune manually, use -N to disable auto-
	      matic pruning.

       -p, --prune
	      Remove redundant entries from the database and  clean  it	 up  a
	      little.	This  is  automatically	 done  after  several calls to
	      --mark-spam or --mark-nonspam, and during training with  --train
	      if  the  training	 takes	a large number of rounds, so it should
	      rarely be necessary to use --prune manually unless you are using
	      -N / --no-autoprune.

       -X, --prune-max NUM
	      When the database is being pruned, no more than NUM entries will
	      be considered for removal.  This is to prevent  CPU  and	memory
	      resources	 being taken over.  The default is 100,000 but in some
	      circumstances (if you find that pruning  takes  too  long)  this
	      option may be used to reduce it to a more manageable number.

FILES
       /var/lib/qsfdb
	      The default (system-wide) spam database.	If you wish to install
	      qsf system-wide, this should be  read-only  to  everyone;	 there
	      should  be  one  user  with write access who can update the spam
	      database with qsf --mark-spam and qsf --mark-non-spam when  nec-
	      essary.

       /var/lib/qsfdb2
	      A	 second,  read-only,  system-wide database. This can be useful
	      when installing  qsf  system-wide	 and  using  third-party  spam
	      databases; the first global database can be updated with system-
	      specific changes, and this second database can  be  periodically
	      updated when the third-party spam database is updated.

       $HOME/.qsfdb
	      The  default  spam  database  for	 per-user data.	 Users without
	      write access to the system-wide database will  have  their  data
	      written  here, and the two databases will be read together.  The
	      per-user database will be given a	 weighting  equivalent	to  10
	      times the weighting of the global database.

NOTES
       Currently,  you	cannot use qsf to check for spam while the database is
       being updated.  This means that while an update	is  in	progress,  all
       email is passed through as non-spam.

       There  is  an  upper  size  limit  of 512Kb on incoming email; anything
       larger than this is just passed through as non-spam, to avoid tying  up
       machine resources.

       The  plaintext  token  mapping  maintained  by  --plain-map  will never
       shrink, only grow.  It is intended for use  by  housekeeping  and  user
       interface  scripts  that,  for  instance,  the user can use to list all
       email addresses on their allow-list.  These scripts should take care of
       weeding	out entries for tokens that are no longer in the database.  If
       you have no such scripts, there is probably no point in using  --plain-
       map anyway.

       Avoid  using  the deny-list (-y) in any automated retraining, as it can
       be cause the filter to reject mail unnecessarily.  In general the deny-
       list  is	 probably  best left unused unless explicitly required by your
       particular setup.

       If both the allow-list  and  the	 deny-list  are	 enabled,  then	 email
       addresses  will first be checked against the deny-list, then the allow-
       list, then the domain of the email address will be checked for matching
       "@domain" entries in the deny-list and then in the allow-list.

EXAMPLES
       To filter all of your mail through qsf, with the allow-list enabled and
       the "spam rating" header being added,  add  this	 to  your  .procmailrc
       file:

	       :0 wf
	       | qsf -ra

       If  you want qsf to add "[SPAM]" to the subject line of any messages it
       thinks are spam, do this instead:

	       :0 wf
	       | qsf -sra

       To automatically mark any email sent to spambox@yourdomain.com as  spam
       (this is the "naive" version):

	       :0 H
	       * ^To:.*spambox@yourdomain.com
	       | qsf -am

       To  do  the  same,  but cleverly, so that only email to spambox@yourdo-
       main.com which qsf does NOT already classify as	spam  gets  marked  as
       spam  in	 the  database	(this  stops  the database getting too heavily
       weighted):

	       # If sent to spambox@yourdomain.com:
	       :0
	       * ^To:.*spambox@yourdomain.com
	       {
		  :0 wf
		  | qsf -a

		  # The above two lines can be skipped if you've
		  # already piped the message through qsf.

		  # If the qsf database says it's not spam,
		  # mark it as spam!
		  :0 H
		  * ^X-Spam: NO
		  | qsf -am
	       }

       Remove the -a option in the above examples if you don't want to use the
       allow-list.

       A  more	complicated filtering example - this will only run qsf on mes-
       sages which don't have a subject line saying "your  <something>	is  on
       fire"  and  which  don't have a sender address ending in "@foobar.com",
       meaning that messages with that subject line  OR	 that  sender  address
       will NEVER be marked as spam, no matter what:

	       :0 wf
	       * ! ^Subject: Your .* is on fire
	       * ! ^From: .*@foobar.com
	       | qsf -ra

       For  more  on  procmail(1)  recipes,  see  the  procmailrc(5) and proc-
       mailex(5) manual pages.

       A couple of macros to add to your .muttrc file, if you use mutt(1) as a
       mail user agent:

	       # Press F5 to mark a message as spam and delete it
	       macro index <f5> "<pipe-message>qsf -am\n<delete-message>"
	       macro pager <f5> "<pipe-message>qsf -am\n<delete-message>"

	       # Press F9 to mark a message as non-spam
	       macro index <f9> "<pipe-message>qsf -aM\n"
	       macro pager <f9> "<pipe-message>qsf -aM\n"

       Again,  remove the -a option in the above examples if you don't want to
       use the allow-list.

       Note, however, that the above macros won't work when operating on  mul-
       tiple tagged messages. For that, you'd need something like this:

	       macro   index   <f5>  ":set  pipe_split\n<tag-prefix><pipe-mes-
	      sage>qsf -am\n<tag-prefix><delete-message>\n:unset pipe_split\n"

       If you use qmail(7), then to get procmail working with it you will need
       to put a line containing just DEFAULT=./Maildir/ at  the	 top  of  your
       ~/.procmailrc  file,  so	 that procmail delivers to your Maildir folder
       instead of trying to deliver to	/var/spool/mail/$USER,	and  you  will
       need to put this in your ~/.qmail file:

	       | preline procmail

       This  will  cause all your mail to be delivered via procmail instead of
       being delivered directly into your mail directory.

       See the qmail(7) documentation for more about mail delivery with qmail.

       If you use postfix(1), you can set up a system-wide mail filter by cre-
       ating a user account for the purpose of filtering mail, populating that
       account's  .qsfdb,  and	then  creating	a shell script, to run as that
       user, which runs qsf on stdin and passes stdout to sendmail(8).

       Doing this requires some knowledge of postfix  configuration  and  care
       needs  to  be  taken to avoid mail loops.  One qsf user's full HOWTO is
       included in the doc/ directory with this package.

THE ALLOW-LIST
       A feature called the "allow-list" can be switched on by specifying  the
       --allowlist  or	-a option.  This causes messages' "From:" and "Return-
       Path:" addresses to be checked against a list of people you  have  said
       to  allow  all  messages	 from,	and if a message's "From:" or "Return-
       Path:" address is in the list, it is never marked as spam.  This	 means
       you can add all your friends to an "allow-list" and qsf will then never
       mis-file their messages - a quick way to do this is to use -a  with  -T
       (train);	 everyone  in  your  non-spam folder who has sent you an email
       will be added to the allow-list automatically during training.

       You can manually add and remove addresses to and	 from  the  allow-list
       using  the  -e  (email) option. For instance, to add foo@bar.com to the
       allow-list, do this:

	       qsf -e foo@bar.com -M

       To remove bad@nasty.com from the allow-list, do this:

	       qsf -e bad@nasty.com -m

       And to see whether someone@somewhere.com is in the allow-list  or  not,
       just do this:

	       qsf -e someone@somewhere.com

       In  general,  you  probably  always  want  to enable the allow-list, so
       always specify the -a option when using qsf.  This  will	 automatically
       maintain the allow-list based on what you classify as spam or non-spam.

       The only times you might want to turn it off are when  people  on  your
       allow-list  are prone to getting viruses or if a virus is causing email
       to be sent to you that is pretending to be from someone on your	allow-
       list.

BACKUP AND RESTORE
       Because	the database format is platform-specific, it is a good idea to
       periodically dump the database to a text file using qsf -D so that,  if
       necessary,  it  can be transferred to another machine and restored with
       qsf -R later on.

       Also note that since the actual contents of email  messages  are	 never
       stored  in  the	database (see TECHNICAL DETAILS), you can safely share
       your qsf database with friends - simply dump your database to  a	 file,
       like this:

	       qsf -D > your-database-dump.txt

       Once  you  have sent your-database-dump.txt to another person, they can
       do this:

	       qsf -R < your-database-dump.txt

       They will then have an identical database to yours.

TECHNICAL DETAILS
       When a message is passed to qsf, any attachments are decoded, all  HTML
       elements	 are  removed,	and  the  message  text is then broken up into
       "tokens", where a "token" is a single  word  or	URL.   Each  token  is
       hashed  using  the  MD5 algorithm (see below for why), and that hash is
       then used to look up each token in the qsf database.

       For full details of which parts of an  email  (headers,	body,  attach-
       ments, etc) are used to calculate the spam rating, see the TOKENISATION
       section below.

       Within the database, each token has two numbers associated with it: the
       number  of  times  that	token has been seen in spam, and the number of
       times it has been seen in non-spam.  These two numbers, along with  the
       total  number of spam and non-spam messages seen, are then used to give
       a "spamminess" value for	 that  particular  token.   This  "spamminess"
       value  ranges  from  "definitely	 not  spammy" at one end of the scale,
       through "neutral" in the middle, up to "definitely spammy" at the other
       end.

       Once  a "spamminess" value has been calculated for all of the tokens in
       the message, a summary calculation is made to give an overall "is  this
       spam?"  probability rating for the message.  If the overall probability
       is 0.9 or above, the message is flagged as spam.

       In addition to the probability test is the  "allow-list".   If  enabled
       (with  the  -a  option),	 the whole probability check is skipped if the
       sender of the message is listed in the allow-list, and the  message  is
       not marked as spam.

       When  training  the  database,  a  message  is  split up into tokens as
       described above, and then the numbers in the database  for  each	 token
       are  simply  added  to: if you tell qsf that a message is spam, it adds
       one to the "number of times seen in spam" counter for each  token,  and
       if  you	tell  it  a message is not spam, it adds one to the "number of
       times seen in non-spam" counter for  each  token.   If  you  specify  a
       weight, with -w, then the number you specify is added instead of one.

       To  stop	 the database growing uncontrollably, the database keeps track
       of when a token was last	 used.	 Underused  tokens  are	 automatically
       removed	from  the  database.  (The old method was to "prune" every 500
       updates).

       Finally, the reason MD5 hashes were used is  privacy.   If  the	actual
       tokens  from the messages, and the actual email addresses in the allow-
       list, were stored, you could not share a single	qsf  database  between
       multiple	 users	because	 bits  of  everyone's messages would be in the
       database - things like emailed passwords, keywords relating to personal
       gossip, and so on.  So a hash is stored instead.	 A hash is a "one-way"
       function; it is easy to turn a token into a hash but  very  hard	 (some
       might  say  impossible) to turn a hash back into the token that created
       it.  This means that you end up with a database with no personal infor-
       mation in it.

TOKENISATION
       When  a	message is broken up into tokens, various parts of the message
       are treated in different ways.

       First, all header fields are discarded, except for the important	 ones:
       From, Return-Path, Sender, To, Reply-To, and Subject.

       Next,  any MIME-encoded attachments are decoded.	 Any attachments whose
       MIME type starts with "text/" (i.e. HTML and text) are tokenised, after
       having  any  HTML  tags	stripped.   Any	 non-textual  attachments  are
       replaced with their MD5 hash (such that two identical attachments  will
       have the same hash), and that hash is then used as a token.

       In  addition to single-word tokens from textual message parts, qsf adds
       doubled-up tokens so that word pairs get added to the  database.	  This
       makes  the  database a bit bigger (although the automatic pruning tends
       to take care of that) but makes matching more exact.

SPECIAL FILTERS
       As well as using the textual content of email to detect spam, qsf  also
       uses  special  filters  which  create  "pseudo-tokens" based on various
       rules.  This means that specific patterns, not just  individual	words,
       can be used to determine whether a message is spam or not.

       For  example,  if a message contains lots of words with multiple conso-
       nants, like "ashjkbnxcsdjh", then each time a word like	that  is  seen
       the  special  token  ".GIBBERISH-CONSONANTS."  is  added to the list of
       tokens found in the message.  If it turns out that most	messages  with
       words  that trigger this filter rule are spam, then other messages with
       gibberish consonant strings will be more likely to be flagged as	 spam.

       Currently the special filters are:

       GTUBE  Flags	 any	  message      containing      the	string
	      XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-
	      EMAIL*C.34X as spam - useful for testing that your qsf installa-
	      tion is working.

       ATTACH-SCR

       ATTACH-PIF

       ATTACH-EXE

       ATTACH-VBS

       ATTACH-VBA

       ATTACH-LNK

       ATTACH-COM

       ATTACH-BAT
	      Adds a token for every attachment whose filename ends in ".scr",
	      ".pif",  ".exe",	".vbs",	 ".vba",  ".lnk",  ".com",  and ".bat"
	      respectively (these are often viruses).

       ATTACH-GIF

       ATTACH-JPG

       ATTACH-PNG
	      Adds a token for every attachment whose filename ends in ".gif",
	      ".jpg" or ".jpeg", and ".png" respectively.

       ATTACH-DOC

       ATTACH-XLS

       ATTACH-PDF
	      Adds a token for every attachment whose filename ends in ".doc",
	      ".xls", or ".pdf" respectively (these tend to  indicate  a  non-
	      spam email).

       SINGLE-IMAGE
	      Adds a token if the message contains exactly one attached image.

       MULTIPLE-IMAGES
	      Adds a token if the message  contains  more  than	 one  attached
	      image.

       GIBBERISH-CONSONANTS
	      Adds  a  token for every word found that has multiple consonants
	      in a row, as described above.  Spam often	 contains  strings  of
	      gibberish.

       GIBBERISH-VOWELS
	      Adds  a token for every word found that has multiple vowels in a
	      row, eg "aeaiaiaeeio".

       GIBBERISH-FROMCONS
	      Like GIBBERISH-CONSONANTS, but only for the "From:" and "Return-
	      Path:" addresses on their own.

       GIBBERISH-FROMVOWL
	      Like  GIBBERISH-VOWELS,  but  only  for the "From:" and "Return-
	      Path:" addresses on their own.

       GIBBERISH-BADSTART
	      Adds a token for every word that starts  with  a	bad  character
	      such as %.

       GIBBERISH-HYPHENS
	      Adds  a  token  for  every  word with more than three hyphens or
	      underscores in it.

       GIBBERISH-LONGWORDS
	      Adds a token for every word with over 30 characters in  it  (but
	      less than 60).

       HTML-COMMENTS-IN-WORDS
	      Adds  a  token  for  every HTML comment found in the middle of a
	      word.   Spam  often  contains  HTML  inside  words,  like	 this:
	      w<!--dsgfhsdgjgh-->ord

       HTML-EXTERNAL-IMG
	      Adds  a  token  for every HTML <img> (image) tag found that con-
	      tains :// (i.e.  it refers to an external image).

       HTML-FONT
	      Adds a token for every HTML <font> tag found.

       HTML-IP-IN-URLS
	      Adds a token for every URL found containing an IP address.

       HTML-INT-IN-URL
	      Adds a token for every URL found containing an  integer  in  its
	      hostname.

       HTML-URLENCODED-URL
	      Adds  a  token  for  every  URL found containing a % sign in its
	      hostname.

       Normally, filters will just cause a token to be added, and these tokens
       are  processed  by  the	normal weighting algorithm.  However the GTUBE
       filter will immediately flag any matching message  as  spam,  bypassing
       the token matching.

DATABASE BACKENDS
       The  inbuilt  "list"  database backend will not necessarily provide the
       best performance, but is provided because using it requires no external
       libraries.

       If,  when  qsf was compiled, the correct libraries were available, then
       it will be possible to use qsf with alternative database backends.   To
       find  out which backends you have available, run qsf -V (capital V) and
       read the second line of output.	To see how well	 a  backend  performs,
       collect	some  spam and non-spam and use qsf -d BACKEND -B SPAM NONSPAM
       (see the entry for -B above).

       Some people find that they get the best performance  out	 of  the  gdbm
       backend; this is a library that is widely available on many systems.

       To  efficiently	share a qsf database across multiple machines, you may
       find the MySQL backend useful.  However, using it is a little more com-
       plicated.

       To  use	the  MySQL  backend  you  will need to create a table with the
       fields key1, key2,  token,  value1,  value2  and	 value3.   The	token,
       value1,	value2,	 and value3 fields must be VARCHAR(64), BIGINT or INT,
       and BIGINT or INT respectively, and indexing on the token  field	 is  a
       good  idea.  The key1 and key2 fields can be anything, but they must be
       present.

       For example:

		USE mydatabase;
		CREATE TABLE qsfdb (
		  key1	    BIGINT UNSIGNED NOT NULL,
		  key2	    BIGINT UNSIGNED NOT NULL,
		  token	    VARCHAR(64) DEFAULT '' NOT NULL,
		  value1    INT UNSIGNED NOT NULL,
		  value2    INT UNSIGNED NOT NULL,
		  value3    INT UNSIGNED NOT NULL,
		  PRIMARY KEY (key1,key2,token),
		  KEY (key1),
		  KEY (key2),
		  KEY (token)
		);

       The key1 and key2 fields allow you to have multiple  qsf	 databases  in
       one  table, by specifying different key1 and key2 values on invocation.

       Instead of specifying a database file with the --database / -d  option,
       you  must  specify either a specification string as described below, or
       the name of a file containing such a string on its first line.

       The specification string is as follows:

		database=DATABASE;host=HOST;port=PORT;
		user=USER;pass=PASS;table=TABLE;
		key1=KEY1;key2=KEY2

       This string must be all on one line, with no spaces.

       DATABASE
	      is the name of the MySQL database.

       HOST   is the hostname of the database server (eg "localhost").

       PORT   is the TCP port to connect on (eg 3306).

       USER   is the username to connect with.

       PASS   is the password to connect with.

       TABLE  is the database table to use.  If a table with  this  name  does
	      not exist when qsf is called in update or training mode, then it
	      will be created if permissions allow this to be done.

       KEY1   is the value to use for the key1 field.

       KEY2   is the value to use for the key2 field.

       Since command lines can be seen in the process  list,  it  is  probably
       best  to	 specify  a  filename (eg qsf -d mysql:qsfdb.spec) and put the
       specification string inside that file.

TROUBLESHOOTING
       If you have problems with qsf, please check the	list  below;  if  this
       does  not  help,	 go  to	 the qsf home page and investigate the mailing
       lists, or email the author.

       Nothing is being marked as spam.
	      First, use the -r option to switch on the X-Spam-Rating  header,
	      and  check that this header appears in email passed through qsf.
	      If it does not, then it is likely that qsf is not being  run  at
	      all - check your configuration of procmail(1) or its equivalent.

	      If you are seeing X-Spam-Rating headers,	and  different	emails
	      have  different scores, then you may simply need to retrain your
	      database a little more.  Take more spam email and pass it to qsf
	      -m.

	      If  you  are  seeing X-Spam-Rating headers but they all give the
	      same spam rating, then the most likely reason is that qsf is not
	      reading any database.  Make sure that whatever is processing the
	      email has read permissions on /var/lib/qsfdb and/or  ~/.qsfdb  -
	      and  make	 sure  that,  if  you  are  using  ~/.qsfdb, what your
	      database creator thought was ~ ($HOME) is the same as it is  for
	      whatever is processing the email.

       Retraining sometimes takes a very long time.
	      With  the	 obtree	 backend  or  2-column MySQL or SQLite tables,
	      every 500th retrain (-m or -M), the database is pruned.  On some
	      systems  this  may  take	some  time,  and  during this time the
	      database is locked (except when using the MySQL or SQLite	 back-
	      ends).   If  you	constantly  do a lot of retraining and want to
	      avoid this, then use the -N option to suppress auto-pruning, and
	      then have a cron(8) job or something run a manual prune (qsf -p)
	      every now and again.

       Running qsf from procmail fails with an error.
	      If you can run qsf from the command line, but in	your  procmail
	      log file you get errors about "qsf: cannot execute binary file",
	      then contact your system administrator for help. It may be  that
	      incoming	email  is handled by a different server to the one you
	      normally shell into, and either they are of a  different	archi-
	      tecture or operating system, or the mail server is not permitted
	      to execute user-owned binaries.

ACKNOWLEDGEMENTS
       The following people have contributed suggestions,  comments,  patches,
       and testing:

	      Tom Parker <http://www.bits.bris.ac.uk/palfrey/>
	      Dr Kelly A. Parker
	      Vesselin Mladenov <http://www.antipodes.bg/>
	      Glyn Faulkner
	      Mark Reynolds
	      Sam Roberts
	      Scott Allen
	      Karsten Kankowski
	      M. Kolbl
	      Micha Holzmann
	      Jef Poskanzer <http://www.acme.com/jef/>
	      Clemens Fischer <http://ino-waiting.gmxhome.de/>
	      Nelson A. de Oliveira
	      Michal Vitecek
	      Tommy Pettersson <http://www.lysator.liu.se/~ptp/>

AUTHOR
       The author:

	      Andrew Wood <andrew.wood@ivarch.com>
	      http://www.ivarch.com/

       Project home page:

	      http://www.ivarch.com/programs/qsf/

BUGS
       If  you find any bugs, please contact the author, either by email or by
       using the contact form on the web site.

SEE ALSO
       procmail(1), procmailrc(5), procmailex(5)

       Someone has written a guide to using qsf with KMail that can  be	 found
       at:
       http://www.softwaredesign.co.uk/Information.SpamFilters.html

LICENSE
       This is free software, distributed under the ARTISTIC license.

Linux				 February 2007				QSF(1)
