DigestIndexer (DeDuplicator3 (Heritrix 3 add-on module) 3.0.0 API)

java.lang.Object
- is.landsbokasafn.deduplicator.DigestIndexer

```
public class DigestIndexer
extends Object
```
A class for building a de-duplication index.
The indexing can be done via the command line options (Run with --help parameter to print usage information) or natively embedded in other applications.
This class also defines string constants for the lucene field names.

Author:

Kristinn Sigurðsson

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`FIELD_DIGEST` The content digest as String
`static String`	`FIELD_ETAG` The document's etag
`static String`	`FIELD_ORIGIN` A field containing meta-data on where the original version of a document is stored.
`static String`	`FIELD_ORIGINAL_DATE` The date of the original payload capture.
`static String`	`FIELD_TIMESTAMP` The URLs timestamp (time of fetch).
`static String`	`FIELD_URL` The URL This value is suitable for use in warc/revist records as the WARC-Refers-To-Target-URI
`static String`	`FIELD_URL_NORMALIZED` A stripped (normalized) version of the URL
`static String`	`MODE_BOTH` Both URL and hash are indexed
`static String`	`MODE_HASH` Index HASH enabling lookups by hash (content digest)
`static String`	`MODE_URL` Index URL enabling lookups by URL.

Constructor Summary

Constructors
Constructor and Description
`DigestIndexer(String indexLocation, String indexingMode, boolean includeNormalizedURL, boolean includeTimestamp, boolean includeEtag, boolean addToExistingIndex)` Each instance of this class wraps one Lucene index for writing deduplication information to it.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`close(boolean optimize)` Close the index.
`static void`	`main(String[] args)`
`static String`	`stripURL(String url)` An aggressive URL normalizer.
`long`	`writeToIndex(CrawlDataIterator dataIt, String mimefilter, boolean blacklist, String defaultOrigin, boolean verbose)` Writes the contents of a `CrawlDataIterator` to this index.
`long`	`writeToIndex(CrawlDataIterator dataIt, String mimefilter, boolean blacklist, String defaultOrigin, boolean verbose, boolean skipDuplicates, long minSize)` Writes the contents of a `CrawlDataIterator` to this index.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - FIELD_URL
```
public static final String FIELD_URL
```
    The URL This value is suitable for use in warc/revist records as the WARC-Refers-To-Target-URI
    
    See Also:
    Constant Field Values
  - FIELD_DIGEST
```
public static final String FIELD_DIGEST
```
    The content digest as String
    
    See Also:
    Constant Field Values
  - FIELD_TIMESTAMP
```
public static final String FIELD_TIMESTAMP
```
    The URLs timestamp (time of fetch). The exact nature of this time may vary slightly depending on the source (i.e. crawl.log and ARCs contain slightly different times but both indicate roughly when the document was obtained. The time is encoded as a String with the Java date format yyyyMMddHHmmssSSS
    
    See Also:
    Constant Field Values
  - FIELD_ETAG
```
public static final String FIELD_ETAG
```
    The document's etag
    
    See Also:
    Constant Field Values
  - FIELD_URL_NORMALIZED
```
public static final String FIELD_URL_NORMALIZED
```
    A stripped (normalized) version of the URL
    
    See Also:
    Constant Field Values
  - FIELD_ORIGIN
```
public static final String FIELD_ORIGIN
```
    A field containing meta-data on where the original version of a document is stored.
    
    See Also:
    Constant Field Values
  - FIELD_ORIGINAL_DATE
```
public static final String FIELD_ORIGINAL_DATE
```
    The date of the original payload capture. Suitable for WARC-Refers-To-Date in warc/revisit records
    
    See Also:
    Constant Field Values
  - MODE_URL
```
public static final String MODE_URL
```
    Index URL enabling lookups by URL. If normalized URLs are included in the index they will also be indexed and searchable.
    
    See Also:
    Constant Field Values
  - MODE_HASH
```
public static final String MODE_HASH
```
    Index HASH enabling lookups by hash (content digest)
    
    See Also:
    Constant Field Values
  - MODE_BOTH
```
public static final String MODE_BOTH
```
    Both URL and hash are indexed
    
    See Also:
    Constant Field Values
- Constructor Detail
  - DigestIndexer
```
public DigestIndexer(String indexLocation,
             String indexingMode,
             boolean includeNormalizedURL,
             boolean includeTimestamp,
             boolean includeEtag,
             boolean addToExistingIndex)
              throws IOException
```
    Each instance of this class wraps one Lucene index for writing deduplication information to it.
    
    Parameters:
    indexLocation - The location of the index (path).
    indexingMode - Index MODE_URL, MODE_HASH or MODE_BOTH.
    includeNormalizedURL - Should a normalized version of the URL be added to the index. See stripURL(String).
    includeTimestamp - Should a timestamp be included in the index.
    includeEtag - Should an Etag be included in the index.
    addToExistingIndex - Are we opening up an existing index. Setting this to false will cause any index at indexLocation to be overwritten.
    
    Throws:
    
    IOException - If an error occurs opening the index.
- Method Detail
  - writeToIndex
```
public long writeToIndex(CrawlDataIterator dataIt,
                String mimefilter,
                boolean blacklist,
                String defaultOrigin,
                boolean verbose)
                  throws IOException
```
    Writes the contents of a CrawlDataIterator to this index.
    This method may be invoked multiple times with different CrawlDataIterators until close(boolean) has been called.
    
    Parameters:
    dataIt - The CrawlDataIterator that provides the data to index.
    mimefilter - A regular expression that is used as a filter on the mimetypes to include in the index.
    blacklist - If true then the mimefilter is used as a blacklist for mimetypes. If false then the mimefilter is treated as a whitelist.
    defaultOrigin - If an item is missing an origin, this default value will be assigned to it. Can be null if no default origin value should be assigned.
    verbose - If true then progress information will be sent to System.out.
    
    Returns:
    The number of items added to the index.
    
    Throws:
    
    IOException - If an error occurs writing the index.
  - writeToIndex
```
public long writeToIndex(CrawlDataIterator dataIt,
                String mimefilter,
                boolean blacklist,
                String defaultOrigin,
                boolean verbose,
                boolean skipDuplicates,
                long minSize)
                  throws IOException
```
    Writes the contents of a CrawlDataIterator to this index.
    This method may be invoked multiple times with different CrawlDataIterators until close(boolean) has been called.
    
    Parameters:
    dataIt - The CrawlDataIterator that provides the data to index.
    mimefilter - A regular expression that is used as a filter on the mimetypes to include in the index.
    blacklist - If true then the mimefilter is used as a blacklist for mimetypes. If false then the mimefilter is treated as a whitelist.
    defaultOrigin - If an item is missing an origin, this default value will be assigned to it. Can be null if no default origin value should be assigned.
    verbose - If true then progress information will be sent to System.out.
    skipDuplicates - Do not add URLs that are marked as duplicates to the index
    minSize - The minimum size of documents added to the index. Documents smaller than this are ignored. Documents with unknown size (CrawlDataItem size set to -1) are not subject to this limit. A value of lesser than or equal to zero disables this feature.
    
    Returns:
    The number of items added to the index.
    
    Throws:
    
    IOException - If an error occurs writing the index.
  - close
```
public void close(boolean optimize)
           throws IOException
```
    Close the index.
    
    Parameters:
    optimize - If true then the index will be optimized before it is closed.
    
    Throws:
    
    IOException - If an error occurs optimizing or closing the index.
  - stripURL
```
public static String stripURL(String url)
```
    An aggressive URL normalizer. This methods removes any www[0-9]. segments from an URL, along with any trailing slashes and all parameters.
    Example: http://www.bok.hi.is/?lang=ice would become http://bok.hi.is
    
    Parameters:
    url - The url to strip
    
    Returns:
    A normalized URL.
  - main
```
public static void main(String[] args)
                 throws Exception
```
    Throws:
    
    Exception

Class DigestIndexer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

FIELD_URL

FIELD_DIGEST

FIELD_TIMESTAMP

FIELD_ETAG

FIELD_URL_NORMALIZED

FIELD_ORIGIN

FIELD_ORIGINAL_DATE

MODE_URL

MODE_HASH

MODE_BOTH

Constructor Detail

DigestIndexer

Method Detail

writeToIndex

writeToIndex

close

stripURL

main