# Installing FuzzyOcr with spamassassin

## Robert S

Has anyone managed to do this?  I upgraded spamassassin to 3.1.4 (this is currently masked) because previous versions do not have the secure_tmpdir() function.  FuzzyOcr requires the String-Approx perl module.  I have created an ebuild which I have submitted here.

The docs specify the following:

 *Quote:*   

> Notes for Gentoo users: All dependencies except the perl modules can be installed via portage. But because of the bugs in giftext and gocr you might need to write an ebuild which uses the two patches found on my download page. The perl modules can easily be installed with gcpan.

 

My question is - is there an ebuild out there that incorporates these patches?  I don't have any experience with patching in ebuilds.  Also - is it actually necessary to use these patches??

----------

## BeatJunkie

I found a link to one this evening:   https://secure.renaissoft.com/maia/wiki/FuzzyOCR23

 *Quote:*   

> Note for Gentoo users:
> 
> Tóth Csaba has provided an ebuild for Gentoo that covers the installation steps described in this document, including patched versions of gocr, giftext, and FuzzyOcr.pm. Download his package http://dev.davidnet.hu/gentoo-portage/fuzzyocr-gentoo-2.tar.bz2 and unpack it into /usr/local, then enable the overlay if necessary in /etc/make.conf:
> 
> PORTDIR_OVERLAY="/usr/local/portage"
> ...

 

That tar archive supplies portage overlays for spamassassin-fuzzyocr, giflib, gocr and String-Approx.  I already had the unpatched version of giflib installed, so I re-emerged it brfore installing fuzzyocr.  I had to rebuild some of the overlayed packages' digests too.

For me, the actual spamassassin-fuzzyocr ebuild failed in a lot of places.  I might consider using Tóth's overlays for the prerequisites, but maybe your ebuild for fuzzyocr itself.

----------

## Robert S

Tried it.  Everything emerged OK until I got up to the last ebuild - spamassassin-fuzzyocr.  Had various problems (eg. emerge reported that the file had not downloaded, despite the fact that it did download.

I ended up installing the plugin manually.  Very easy.

I tried "spamassassin -t" on the files in the "samples" directory.  It correctly identified the animated-gif.eml and corrupted-gif.eml files as spam, but did not register with the other two files (jpeg.eml and png.eml).  Some of the other tests were positive.

Anyway I've got my real mailserver set up with the gentoo unpatched versions of gocr and giflib, and I haven't had any spams containing graphics being missed since late last week.  I think I'll wait for the patched versions to come into portage and do the rest manually.

----------

## BeatJunkie

I managed to get Tóth's ebuild to work by commenting out the patches (they seem to be incorporated in the newest version of FuzzyOCR already), and then I changed the first line in the "src_install ()" section so that it reads:

```
cd ${WORKDIR}/FuzzyOcr-2.3b
```

The ebuild worked, and then I followed the instructions in the wiki page I had mentioned in my previous post to set it up with SpamAssassin...and it's working!

I tend to prefer using ebuilds whenever possible so that all the prerequisites are covered, but for those who would rather not figure out a broken ebuild, FuzzyOCR itself is a pretty simple package involving only three files to be deployed.

----------

## el*Loco

Quick note: http://dev.davidnet.hu/gentoo-portage/fuzzyocr-gentoo-3.tar.bz2 exists and fixes (at least my) problems with digests etc.

----------

## warthog

FYI, there is a bug in bugzilla to have the e-build added to portage:

https://bugs.gentoo.org/show_bug.cgi?id=154392

----------

## fabio-c

I wonder if anyone knows when the ebuilds will actually be added to the tree?

I'm kinda waiting for them, installed the unofficial ebuilds on a few machines, but adding overlays all the time isn't the easiest way, so when does this thing hit the tree?  :Wink: 

FuzzyOCR rocks btw, its fishing out 2000+ mails/day here.

Haven't experienced any segfaults/crashes.

----------

## tomk

 *fabio-c wrote:*   

> I wonder if anyone knows when the ebuilds will actually be added to the tree?

 

I'm going to be maintaining FuzzyOCR and I've been keeping in touch with the guy who writes it and he's just released a new version. Once I get a new ebuild ready and I've had a look at the segfaulting problems (he's supplied me with images that will cause segfaults in gocr and giftext) I'll add it to the tree, especially now that the String::Approx perl module has been added to the tree.

Seeing as the patches needed to prevent the segfaults will have to be applied upstream or to the relevant ebuilds it may be a while before those patches are added, but I'll probably add the FuzzyOCR ebuild before then (with an appropriate warning).

I should hopefully get enough time to look at it this week, if you CC yourself on the bug then you'll know when it's sorted (and I'll post here when I've added the ebuild to the tree).

----------

## fabio-c

 *tomk wrote:*   

>  *fabio-c wrote:*   I wonder if anyone knows when the ebuilds will actually be added to the tree? 
> 
> I'm going to be maintaining FuzzyOCR and I've been keeping in touch with the guy who writes it and he's just released a new version. Once I get a new ebuild ready and I've had a look at the segfaulting problems (he's supplied me with images that will cause segfaults in gocr and giftext) I'll add it to the tree, especially now that the String::Approx perl module has been added to the tree.
> 
> Seeing as the patches needed to prevent the segfaults will have to be applied upstream or to the relevant ebuilds it may be a while before those patches are added, but I'll probably add the FuzzyOCR ebuild before then (with an appropriate warning).
> ...

 

That would be great! (adding it to the tree without the patches) (I'm willing to take the risk of a segfault for a few thousand less spams!)

5 days of a FuzzyOCR enabled SpamAssassin's life:

```

[0:44]fabio@black[/home/fabio]# grep -c FUZZY_OCR /var/log/exim/exim_rejectlog*

/var/log/exim/exim_rejectlog:4

/var/log/exim/exim_rejectlog.0:514

/var/log/exim/exim_rejectlog.1:707

/var/log/exim/exim_rejectlog.2:794

/var/log/exim/exim_rejectlog.3:410

/var/log/exim/exim_rejectlog.4:376

```

----------

## figueroa

 *tomk wrote:*   

> I'm going to be maintaining FuzzyOCR and I've been keeping in touch with the guy who writes it and he's just released a new version. Once I get a new ebuild ready and I've had a look at the segfaulting problems (he's supplied me with images that will cause segfaults in gocr and giftext) I'll add it to the tree, especially now that the String::Approx perl module has been added to the tree.
> 
> Seeing as the patches needed to prevent the segfaults will have to be applied upstream or to the relevant ebuilds it may be a while before those patches are added, but I'll probably add the FuzzyOCR ebuild before then (with an appropriate warning).
> 
> I should hopefully get enough time to look at it this week, if you CC yourself on the bug then you'll know when it's sorted (and I'll post here when I've added the ebuild to the tree).

 

I'm not sure what the status is with the FuzzyOCR ebuild, but I've just installed it on my home mailserver, and I'm in the process now of installing it on the school's mail server.  In my own limited testing of SpamAssassin w/ FuzzyOCR it nailed my test emails very adequately.  Didn't notice any warnings when installing FuzzyOCR other than a reminder to restart spamd.  So far, so good, and I'm happy!  (Added:  Oops!  My mistake.  There was/is a warning about FuzzyOCR segfaulting on some images, which brings me to a question.  If FuzzyOCR segfaults on an image, will the plugin continue to owrk for subsequent messages, or do I need to do something to re-jump-start it?)

----------

## fabio-c

 *figueroa wrote:*   

>  *tomk wrote:*   I'm going to be maintaining FuzzyOCR and I've been keeping in touch with the guy who writes it and he's just released a new version. Once I get a new ebuild ready and I've had a look at the segfaulting problems (he's supplied me with images that will cause segfaults in gocr and giftext) I'll add it to the tree, especially now that the String::Approx perl module has been added to the tree.
> 
> Seeing as the patches needed to prevent the segfaults will have to be applied upstream or to the relevant ebuilds it may be a while before those patches are added, but I'll probably add the FuzzyOCR ebuild before then (with an appropriate warning).
> 
> I should hopefully get enough time to look at it this week, if you CC yourself on the bug then you'll know when it's sorted (and I'll post here when I've added the ebuild to the tree). 
> ...

 

It just seems to kill the spamd child when that happens, so no problems there.

----------

## CptanPanic

Any updates?

----------

## CptanPanic

What does this mean, when I search for fuzzyocr in portage is shows up, but when I try to add it is says it is not available?

```

mail mail-filter # ACCEPT_KEYWORDS="~amd64" emerge  -s fuzzyocr

Searching...

[ Results for search key : fuzzyocr ]

[ Applications found : 1 ]

*  mail-filter/spamassassin-fuzzyocr

      Latest version available: 2.3b

      Latest version installed: [ Not Installed ]

      Size of files: 74 kB

      Homepage:      http://fuzzyocr.own-hero.net/

      Description:   SpamAssassin plugin for performing Optical Character Recognition (OCR) on attached images

      License:       Apache-2.0

mail mail-filter # ACCEPT_KEYWORDS="~amd64" emerge  fuzzyocr

Calculating dependencies

emerge: there are no ebuilds to satisfy "fuzzyocr".

```

----------

## Caffeine

The package is called spamassassin-fuzzyocr, not fuzzyocr.  Try:

```
# ACCEPT_KEYWORDS="~amd64"  emerge spamassassin-fuzzyocr 
```

----------

## CptanPanic

duh!!!!

Thanks.

----------

## serotonin

As for installing gocr, i found it a lot easier to just install the newest source

 gocr-0.43.tar.gz

rather than emerging an older version with patches

----------

## strugarevic

Is FuzzyOCR (from portage) ready for production? I did't tried it yet but i'm hopping soon.

----------

## prymitive

 *strugarevic wrote:*   

> Is FuzzyOCR (from portage) ready for production? I did't tried it yet but i'm hopping soon.

 

I use it on 2 boxes and it works without problems, it catches some additional spam.

----------

## strugarevic

I'm not very familiar with the ebuild. I supposed the ebuild has all needed files/patches for fuzzyocr and other dependencies to work fine?

Thanks!

----------

## serotonin

http://fuzzyocr.own-hero.net/wiki/Downloads

I tried the ebuild and the manual install.  I prefer to use gentoo ebuilds to install the dependencies listed on the URL above, manually install the new source for gocr, and follow the install instructions on the URL above to install fuzzyocr manually.  If you download the new source for gocr, there is no need to patch.

As for the effectiveness of fuzzyocr, it's amazing.

----------

## Robert S

I installed this package on an x86 box using the ebuild at https://bugs.gentoo.org/show_bug.cgi?id=158445 and it works great.  When I try to run it on my "real" mailserver, which is amd64, I get this ugliness:

 *Quote:*   

> $ spamassassin --lint
> 
> [1305] warn: config: cannot open "/etc/mail/spamassassin/secrets.cf": Permission denied
> 
> Subroutine FuzzyOcr::O_NONBLOCK redefined at /usr/lib64/perl5/5.8.8/Exporter.pm line 65.
> ...

 

Looks like it can't find the modules in /etc/mail/spamassassin/FuzzyOcr???

I might try to manually download the package from the original FuzzyOcr website.  This worked best last time.

[EDIT]

Yes - the installation recommended at http://fuzzyocr.own-hero.net/wiki/Installation-3.5.x works fine with a few modifications.  The errors are now down to this:

 *Quote:*   

> $ spamassassin --lint
> 
> Subroutine FuzzyOcr::O_NONBLOCK redefined at /usr/lib64/perl5/5.8.8/Exporter.pm line 65.
> 
>  at /usr/lib64/perl5/5.8.8/x86_64-linux/POSIX.pm line 19
> ...

 

How can I get rid of these?  It seems to be impossible to stop it looking for ocrad without hacking the code.

[EDIT]

I installed ocrad.  Its a tiny package and it works fine, except for the --lint error message.Last edited by Robert S on Thu Mar 08, 2007 8:27 am; edited 1 time in total

----------

## fusel

Because i wanted to install FuzzyOCR right now, I was looking for some info why the package is masked. Thats how I found your post.

Maybe you could get rid of your problem with looking for ocrad by just emerging 

```
emerge app-text/ocrad
```

Looks like FuzzyOCR is based on this. Just an idea.

[Edit:]

Mh, strange... I found this one on Chris' website:

 *Quote:*   

> The OCR program gocr will be invoked to get the final text from the PNM.

 

Any ideas? Btw, i was able to just install 

```
gizmo mail # ACCEPT_KEYWORDS="~x86"  emerge spamassassin-fuzzyocr
```

without any errors on a P4HT machine.

[Edit2:]

After updating gocr to latest masked (0.43) as suggested by serotonin (thanks), I can confirm proper operation of fuzzyocr. GIF's are scanned and points are added to spamassassins list:

```
Content analysis details:   (16.2 points, 7.5 required)

 pts rule name              description

---- ---------------------- --------------------------------------------------

 0.1 HTML_TAG_EXIST_TBODY   BODY: HTML has "tbody" tag

 0.0 HTML_MESSAGE           BODY: HTML included in message

 0.0 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts

 2.0 RCVD_IN_SORBS_DUL      RBL: SORBS: sent directly from dynamic IP address

                             [85.XXX.XXX.XXX listed in dnsbl.sorbs.net]

 1.7 RCVD_IN_NJABL_DUL      RBL: NJABL: dialup sender did non-local SMTP

                             [85.XXX.XXX.XXX listed in combined.njabl.org]

 25 FUZZY_OCR               BODY: Mail contains an image with common spam text inside

                             Words found:

                             "discount" in 1 lines

                             "buy" in 1 lines

                             "sale" in 3 lines

                             "sell" in 2 lines

                             "cheap" in 1 lines

                             "cheapest" in 1 lines

                             "presciption" in 1 lines

                             "guarantee" in 1 lines

                             ...

                             (NN word occurrences found)

 -13 AWL                    AWL: From: address is in the auto white-list
```

Thanks for the tipps

fusel

----------

