# FuzzyOCR help (tesseract, ocrad-decolorize, etc) [SOLVED]

## hanj

Hello All

Having trouble with Spamassassin-fuzzyocr. I just emerged this since I'm getting my butt kicked with image spam lately. After getting it up and running.. and it appears to be working and scoring spamming images, I'm seeing the following 'issues' in the mail.log. I'm hoping that someone might be able to point me in the right direction on cleaning these up.

```
Apr 23 18:30:22 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Unable to read output from "/var/amavis/tmp/.spamassassin20440b4y50jtmp/scanset.tesseract.out.txt" for scanset tesseract

Apr 23 18:30:22 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "tesseract"

Apr 23 18:30:22 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: Unable to load unicharset file /usr/share/tessdata/eng.unicharset

Apr 23 18:30:22 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Skipping scanset because of errors, trying next...

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Error running preprocessor(pamthreshold): /usr/bin/pamthreshold -simple -threshold 0.5

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "ocrad-decolorize-invert"

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: pamthreshold: bad magic number 0x8d8c - not a PAM, PPM, PGM, or PBM file

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Skipping scanset because of errors, trying next...

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Error running preprocessor(pamthreshold): /usr/bin/pamthreshold -simple -threshold 0.5

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "ocrad-decolorize"

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: pamthreshold: bad magic number 0x8d8c - not a PAM, PPM, PGM, or PBM file

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Skipping scanset because of errors, trying next...

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Unable to read output from "/var/amavis/tmp/.spamassassin20440b4y50jtmp/scanset.tesseract.out.txt" for scanset tesseract

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "tesseract"

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: Unable to load unicharset file /usr/share/tessdata/eng.unicharset

Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Skipping scanset because of errors, trying next...

Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Error running preprocessor(pamthreshold): /usr/bin/pamthreshold -simple -threshold 0.5

Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "ocrad-decolorize-invert"

Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: pamthreshold: bad magic number 0x8181 - not a PAM, PPM, PGM, or PBM file

Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Skipping scanset because of errors, trying next...

Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Error running preprocessor(pamthreshold): /usr/bin/pamthreshold -simple -threshold 0.5

Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "ocrad-decolorize"

Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: pamthreshold: bad magic number 0x8181 - not a PAM, PPM, PGM, or PBM file
```

Here are my relevant packages and use flags:

```
[ebuild   R   ] mail-mta/postfix-2.5.5  USE="mysql pam sasl ssl vda -cdb -dovecot-sasl -hardened -ipv6 -ldap -mailwrapper -mbox -nis -postgres (-selinux)" 3,097 kB

[ebuild   R   ] mail-filter/spamassassin-3.2.1-r1  USE="berkdb mysql ssl -doc -ipv6 -ldap -postgres -qmail -sqlite -tools" 959 kB

[ebuild   R   ] mail-filter/amavisd-new-2.6.1-r1  USE="mysql -courier -dkim -ldap -milter -postgres -qmail -razor -spamassassin" 891 kB

[ebuild   R   ] mail-filter/spamassassin-fuzzyocr-3.5.1-r1  USE="amavis dbm gocr logrotate mysql ocrad tesseract" 0 kB
```

Thanks!

hanji

----------

## hanj

Hello All

I have this all sorted out now. I'll post how I did it for others...

ocrad and pamthreshold errors were solved by adding the following patches for tempdir handling located here:

https://bugs.gentoo.org/attachment.cgi?id=175916&action=view

https://bugs.gentoo.org/attachment.cgi?id=175917&action=view

https://bugs.gentoo.org/show_bug.cgi?id=251687

```
diff -ur FuzzyOcr.orig/Deanimate.pm FuzzyOcr/Deanimate.pm

--- FuzzyOcr.orig/Deanimate.pm   Sun Jan  7 19:05:18 2007

+++ FuzzyOcr/Deanimate.pm   Thu Nov 15 13:19:00 2007

@@ -8,13 +8,14 @@

 use FuzzyOcr::Config qw(get_config set_config get_tmpdir);

 use FuzzyOcr::Misc qw(save_execute);

 use FuzzyOcr::Logging qw(errorlog warnlog infolog);

+use File::Basename qw(dirname);

 

 # Provide functions to deanimate gifs

 

 sub deanimate {

     my $conf = get_config();

-    my $imgdir = get_tmpdir();

     my $tfile = shift;

+    my $imgdir = dirname($tfile);

     my $efile = $tfile . ".err";

     my $tfile2 = $tfile;

     my $tfile3 = $tfile;

@@ -58,8 +59,8 @@

 

 sub gif_info {

     my $conf = get_config();

-    my $imgdir = get_tmpdir();

     my $giffile = $_[0];

+    my $imgdir = dirname($giffile);

     

     my $fd = new IO::Handle;

     

diff -ur FuzzyOcr.orig/Preprocessor.pm FuzzyOcr/Preprocessor.pm

--- FuzzyOcr.orig/Preprocessor.pm   Sun Jan  7 19:05:18 2007

+++ FuzzyOcr/Preprocessor.pm   Thu Nov 15 12:31:05 2007

@@ -1,5 +1,7 @@

 package FuzzyOcr::Preprocessor;

 

+use File::Basename qw(dirname);

+

 sub new {

     my ($class, $label, $command, $args) = @_;

 

@@ -12,7 +14,7 @@

 

 sub run {

     my ($self, $input) = @_;

-    my $tmpdir = FuzzyOcr::Config::get_tmpdir();

+    my $tmpdir = dirname($input);

     my $label = $self->{label};

     my $output = "$tmpdir/prep.$label.out";

     my $stderr = ">$tmpdir/prep.$label.err";

diff -ur FuzzyOcr.orig/Scanset.pm FuzzyOcr/Scanset.pm

--- FuzzyOcr.orig/Scanset.pm   Sun Jan  7 19:05:18 2007

+++ FuzzyOcr/Scanset.pm   Thu Nov 15 13:20:39 2007

@@ -2,6 +2,7 @@

 

 use lib qw(..);

 use FuzzyOcr::Logging qw(errorlog);

+use File::Basename qw(dirname);

 

 sub new {

     my ($class, $label, $preprocessors, $command, $args, $output_in) = @_;

@@ -19,7 +20,7 @@

 sub run {

     my ($self, $input) = @_;

     my $conf = FuzzyOcr::Config::get_config();

-    my $tmpdir = FuzzyOcr::Config::get_tmpdir();

+    my $tmpdir = dirname($input);

     my $label = $self->{label};

     my $output = "$tmpdir/scanset.$label.out";

     my $stderr = ">$tmpdir/scanset.$label.err";
```

and

```
diff -u -r FuzzyOcr-3.5.1-orig/FuzzyOcr.pm FuzzyOcr-3.5.1/FuzzyOcr.pm

--- FuzzyOcr-3.5.1-orig/FuzzyOcr.pm   2007-01-07 04:05:08.000000000 -0800

+++ FuzzyOcr-3.5.1/FuzzyOcr.pm   2007-04-17 14:21:25.000000000 -0700

@@ -146,7 +146,7 @@

             ){

             $fname = join('',@{$p->{'headers'}->{'content-id'}});

             $fname =~ s/[<>]//g;

-            $fname =~ tr/\@\$\%\&/_/s;

+            $fname =~ tr/\@\$\%\&\./_/s;

         }

 

         my $filename = $fname; $filename =~ tr{a-zA-Z0-9\-.}{_}cs;

```

The error about tesseract was interesting: 

```
Apr 23 18:30:22 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "tesseract" 

Apr 23 18:30:22 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: Unable to load unicharset file /usr/share/tessdata/eng.unicharset 
```

The file existed, but had 0 length. Doing some more research I see that you need to build it with the proper lingua. I added LINGUA="en" in my make.conf and added the tiff USE flag and rebuilt it

```
[ebuild   R   ] app-text/tesseract-2.03  USE="tiff" LINGUAS="en -de -de_FR -es -fr -it -nl -pt -vi" 2,040 kB
```

After the build, the eng.unicharset file had length. I thought my problems would be solved, but I received another error about tesseract:

```
Apr 24 08:11:49 comp[16615]: (16615-16) (!)SA error: FuzzyOcr: Unable to read output from "/var/amavis/tmp/.spamassassin16615WjYr4Utmp/scanset.tesseract.out.txt" for scanset tesseract

Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: Errors in Scanset "tesseract"

Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: Return code: 7936, Error: Tesseract Open Source OCR Engine

Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: name_to_image_type:Error:Unrecognized image type:/var/amavis/tmp/.spamassassin16615WjYr4Utmp/prep.maketiff.out

Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: IMAGE::read_header:Error:Can't read this image type:/var/amavis/tmp/.spamassassin16615WjYr4Utmp/prep.maketiff.out

Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: /usr/bin/tesseract:Error:Read of file failed:/var/amavis/tmp/.spamassassin16615WjYr4Utmp/prep.maketiff.out

Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: Signal_exit 31 ABORT. LocCode: 3 AbortCode: 3

Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: Skipping scanset because of errors, trying next...
```

Doing some Googling I found this link:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=481383

Which had the following patch to handle this:

```
--- Preprocessor.pm.ORIG   2008-05-15 18:24:22.000000000 +0200

+++ Preprocessor.pm   2008-05-15 18:51:03.000000000 +0200

@@ -15,6 +15,9 @@ sub run {

     my $tmpdir = FuzzyOcr::Config::get_tmpdir();

     my $label = $self->{label};

     my $output = "$tmpdir/prep.$label.out";

+    if ($label =~ /maketiff/) {

+        $output = "$tmpdir/prep.$label.tif";

+    }

     my $stderr = ">$tmpdir/prep.$label.err";

 

     my $stdin = undef;

--- Scanset.pm.ORIG   2008-05-15 18:56:11.000000000 +0200

+++ Scanset.pm   2008-05-15 19:03:26.000000000 +0200

@@ -63,7 +63,12 @@ sub run {

                 return ($retcode,@result);

             }

             # Input of next processor is output of last

-            $input = "$tmpdir/prep.$plabel.out";

+            # Output name of maketiff is special!

+            if ($plabel =~ /maketiff/) {

+                $input = "$tmpdir/prep.$plabel.tif";

+            } else {

+                $input = "$tmpdir/prep.$plabel.out";

+            }

         }

     }

```

After I made all of these changes... errors were much much better. I hope this helps.

hanji

----------

## volumen1

You rule.  This totally fixed me up as well!

----------

## VinnieNZ

There is also this bug which has patches that fix the error in Exporter.pm when spamassassin --lint is run:

https://bugs.gentoo.org/249668

----------

