DeDuplicator Overview.
Getting started
Building an index
- A functional installation of Heritrix is required for this
software to work. While Heritrix can be deployed on non-Linux
operating systems that requires some degree of work as the bundled
scripts are written for Linux. The same applies to this software and
the following instructions assume that Heritrix is installed on a
Linux machine under $HERITRIX_HOME.
- Install the DeDuplicator software. The jar files should be
included in $HERITRIX_HOME/lib/ while the dedupdigest script
should be added to $HERITRIX_HOME/bin/. If you've downloaded a .tar.gz
bundle, explode it into $HERITRIX_HOME and all the files will be
correctly deployed. NOTE: Heritrix can not be running
at the same time as the DeDuplicator software is run.
- Make the dedupdigest script executable with
chmod u+x
$HERITRIX_HOME/bin/dedupdigest
- Run
$HERITRIX_HOME/bin/dedupdigest --help
This will
display the usage information for the indexing.
The program takes two arguments, the source data (crawl.log usually)
and the target directory where the index will be written (will be
created if not present). Several options are provided to custom
tailor the type of index.
- Create an index. A typical index can be built with
$HERITRIX_HOME/bin/dedupdigest -o URL -s -t <location of
crawl.log> <index output directory>
This will create an index that is indexed by URL only (not by the
content digest) and includes equivalent URLs and timestamps.
Using the index
- Having built an appropriate index, launch Heritrix. Make sure that
the installation of Heritrix that you launched has the two JARs that
come with the DeDuplicator (deduplicator-[version].jar and
lucene-[version].jar) if it is not the same one used for creating the
index.
- Configure a crawl job as normal except add the DeDuplicator
processor to the processing chain at some point after the
HTTPFetcher processor and prior to any processor which should be
skipped if a duplicate is detected.
When the DeDuplicator finds a duplicate the processing moves
straight to the PostProcessing chain. So if you insert it at the top
of the Extractor chain you can skip both link extraction and writing
to disk. If you do not wish to skip link extraction you can insert the
processor at the end of the link extraction chain etc.
- The DeDuplicator processor has several configurable parameters.
- enabled Standard Heritrix property for processors.
Should be true. Setting it to false will disable the processor.
- index-location The most important setting. A full path
to the directory that contains the index (output directory of the
indexing.)
- matching-method Whether to lookup URLs or content
digests first when looking for matches. This setting depends on
how the index was built (indexing mode). If it was set to BOTH then
either setting will work. Otherwise it must be set according to the
indexing mode.
- try-equivalent Should equivalent URLs be tried if an
exact URL and content digest match is not found. Using equivalent
matches means that duplicate documents whose URLs differ only in the
parameter list or because of www[0-9]* prefixes are detected.
- mime-filter Which documents to process
- filter-mode
- analysis-mode Enables analysis of the usefulness and
accuracy of header information in predicting change and non-change
in documents. For statistical gathering purposes only.
- log-level Enables more logging.
- stats-per-host Maintains statistics per host in
addition to the crawl wide stats.
- Once the processor has been configured the crawl can be started
and run normally. Information about the processor is available via
the Processor report in the Heritrix GUI (this is saved to
processors-report.txt at the end of a crawl).
Duplicate URLs will still show up in the crawl log but with a note
'duplicate' in the annotation field at the end of the log line.