From: bortzmeyer@pasteur.fr (Stephane Bortzmeyer) Subject: Small contribution: two scripts Date: 14 Feb 1996 16:02:37 GMT Organization: Institut Pasteur, Paris, France I wrote a small bit of Perl code to determine the language of a document, allowing you to add a new attribute "language" (really useful for multi-language servers). Many servers in non-english speaking countries are bilingual. It is often useful to ask only for French pages. This can be done with post-summarizing. We use a post-summarizer which adds an attribute "language" whose value is "fr" or "en" (ISO codes for languages). You can then ask: Bourgogne AND (language : fr) It seems quite easy to recognize French from English documents automatically. Here is our setup: Gatherer config file (gatherername.cf) : Post-Summarizing: lib/myrules In lib/myrules: type == 'HTML' body,language ! find-language.pl In find-language.pl: -------------- cut here ----------------------- #!/usr/local/bin/perl require "$ENV{'HARVEST_HOME'}/lib/soif.pl"; $debug = 0; ($ttype,$url,%SOIF) = &soif'parse; $body = $SOIF{'body'}; $* = 1; # Lines may contain newlines $body =~ /$/; $line = $`; $rest = substr ($', 1); while ($body) { $lines++; last if ($lines > 200); if ($debug > 1) { print STDERR "Line $lines \"$line\"", "\n"; } if ($line =~ /\b(le|la|ou|à|où)\b/i) { $probably_french++; if ($debug > 0) { print STDERR "MATCHED FRENCH: ", $&, "\n"; } } elsif ($line =~ /\b(the|or|but|at)\b/i) { $probably_english++; if ($debug > 0) { print STDERR "MATCHED ENGLISH: ", $&, "\n"; } } else { # } $body = $rest; $body =~ /$/; $line = $`; $rest = substr ($', 1); $body = $rest; } if ($probably_french > $probably_english + ($lines/2)) { $SOIF{'language'} = "fr"; # Does someone know where to find the ISO # codes for human languages? } else { $SOIF{'language'} = "en"; } &soif'print ($ttype, $url, %SOIF); ------------------------------------------------- References : "4.6 Post-Summarizing: Rule-based tuning of object summaries"