# Looks for a webcrawler

## dE_logics

I'm looking forward towards a crawler for easy but long tasks. For e.g. only downloading certain MIME from a website (for e.g. images (jpg, gif, png, tiff etc..) or documents od*, doc, ppt etc...). It should have the following features - 

1) Maximum depth (both internal and external (i.e. for links directing to other websites)).

2) Most important and unique -- stop and resume feature. By stopping I mean restarting the box.

Lastly -- not very hard to use.

I tried a few - 

httrack -- stop/resume feature broken, behavior is unpredictable, it's too slow.

heritrix -- Too difficult to use. It's as difficult as mastering apache. Stop/resume does work but in the process of resuming a job you have to make a copy of the canceled (that's often in GBs).. thus cumbersome.

I know that wiki article, I'm looking forward towards people's opinion.

----------

## avx

Well, there's pavuk (in portage), the best I found so far, but it isn't that easy to use - from the screenshots, you should be able to tell if it's too much for your needs.

----------

## keenblade

Latest pavuk is 0.9.35 	from 2007-02-21. It seems dead to me. Portage has 0.9.34-r2. However it may work.

I did not know heritrix, but it is not in portage. Is there an overlay for it?

Anyway, I could't find better than httrack.

Maybe using wget with some custom scripts will do the job.

I am looking forward to see suggestions, as well. Very interesting topic.

----------

## avx

Sure, pavuk is pretty much dead, but the inner workings of webpages haven't changed that much since - might need to write a script to get real URLs from some AJAX/modern stuff pages, though.

Offline Explorer is a good tool for Win32 which runs quite ok in WINE, is pricey though, but maybe a Win32-solution would also be ok?

----------

## dE_logics

Yeah, I'm trying that out. Unfortunately starting the GUI gives segfault... so I'm in the hard way. Actually I always was doing the hard way with httrack.

I'll be glad if there're more new and similar projects.

Thanks.

----------

## dE_logics

pavuk resume does not work well... I don't know why, but it just finishes downloading the partially downloaded files and then stops

----------

## dE_logics

On second though... no, it works + it's very fast.

----------

## dE_logics

Well, on third thought, it does run fast but encounters too many accidents resulting in segfault.

----------

## dE_logics

Or, at least do we have something which extracts all links for certain MIME types (for e.g. odt, odg, jpg, pnm etc...) and put it to a file?

That way the html/php only browsing will be very fast, it'll complete in hours then I can use wget to download the actual files.

----------

