public class DigestIndexer extends Object
The indexing can be done via the command line options (Run with --help parameter to print usage information) or natively embedded in other applications.
This class also defines string constants for the lucene field names.
Modifier and Type | Field and Description |
---|---|
static String |
FIELD_DIGEST
The content digest as String
|
static String |
FIELD_ETAG
The document's etag
|
static String |
FIELD_ORIGIN
A field containing meta-data on where the original version of a
document is stored.
|
static String |
FIELD_ORIGINAL_DATE
The date of the original payload capture.
|
static String |
FIELD_TIMESTAMP
The URLs timestamp (time of fetch).
|
static String |
FIELD_URL
The URL
This value is suitable for use in warc/revist records as the WARC-Refers-To-Target-URI
|
static String |
FIELD_URL_NORMALIZED
A stripped (normalized) version of the URL
|
static String |
MODE_BOTH
Both URL and hash are indexed
|
static String |
MODE_HASH
Index HASH enabling lookups by hash (content digest)
|
static String |
MODE_URL
Index URL enabling lookups by URL.
|
Constructor and Description |
---|
DigestIndexer(String indexLocation,
String indexingMode,
boolean includeNormalizedURL,
boolean includeTimestamp,
boolean includeEtag,
boolean addToExistingIndex)
Each instance of this class wraps one Lucene index for writing
deduplication information to it.
|
Modifier and Type | Method and Description |
---|---|
void |
close(boolean optimize)
Close the index.
|
static void |
main(String[] args) |
static String |
stripURL(String url)
An aggressive URL normalizer.
|
long |
writeToIndex(CrawlDataIterator dataIt,
String mimefilter,
boolean blacklist,
String defaultOrigin,
boolean verbose)
Writes the contents of a
CrawlDataIterator to this index. |
long |
writeToIndex(CrawlDataIterator dataIt,
String mimefilter,
boolean blacklist,
String defaultOrigin,
boolean verbose,
boolean skipDuplicates,
long minSize)
Writes the contents of a
CrawlDataIterator to this index. |
public static final String FIELD_URL
public static final String FIELD_DIGEST
public static final String FIELD_TIMESTAMP
public static final String FIELD_ETAG
public static final String FIELD_URL_NORMALIZED
public static final String FIELD_ORIGIN
public static final String FIELD_ORIGINAL_DATE
public static final String MODE_URL
public static final String MODE_HASH
public static final String MODE_BOTH
public DigestIndexer(String indexLocation, String indexingMode, boolean includeNormalizedURL, boolean includeTimestamp, boolean includeEtag, boolean addToExistingIndex) throws IOException
indexLocation
- The location of the index (path).indexingMode
- Index MODE_URL
, MODE_HASH
or
MODE_BOTH
.includeNormalizedURL
- Should a normalized version of the URL be
added to the index.
See stripURL(String)
.includeTimestamp
- Should a timestamp be included in the index.includeEtag
- Should an Etag be included in the index.addToExistingIndex
- Are we opening up an existing index. Setting
this to false will cause any index at
indexLocation
to be overwritten.IOException
- If an error occurs opening the index.public long writeToIndex(CrawlDataIterator dataIt, String mimefilter, boolean blacklist, String defaultOrigin, boolean verbose) throws IOException
CrawlDataIterator
to this index.
This method may be invoked multiple times with different
CrawlDataIterators until close(boolean)
has been called.
dataIt
- The CrawlDataIterator that provides the data to index.mimefilter
- A regular expression that is used as a filter on the
mimetypes to include in the index.blacklist
- If true then the mimefilter
is used
as a blacklist for mimetypes. If false then the
mimefilter
is treated as a whitelist.defaultOrigin
- If an item is missing an origin, this default value
will be assigned to it. Can be null if no default
origin value should be assigned.verbose
- If true then progress information will be sent to
System.out.IOException
- If an error occurs writing the index.public long writeToIndex(CrawlDataIterator dataIt, String mimefilter, boolean blacklist, String defaultOrigin, boolean verbose, boolean skipDuplicates, long minSize) throws IOException
CrawlDataIterator
to this index.
This method may be invoked multiple times with different
CrawlDataIterators until close(boolean)
has been called.
dataIt
- The CrawlDataIterator that provides the data to index.mimefilter
- A regular expression that is used as a filter on the
mimetypes to include in the index.blacklist
- If true then the mimefilter
is used
as a blacklist for mimetypes. If false then the
mimefilter
is treated as a whitelist.defaultOrigin
- If an item is missing an origin, this default value
will be assigned to it. Can be null if no default
origin value should be assigned.verbose
- If true then progress information will be sent to
System.out.skipDuplicates
- Do not add URLs that are marked as duplicates to the indexminSize
- The minimum size of documents added to the index. Documents
smaller than this are ignored. Documents with unknown size (CrawlDataItem size set to -1)
are not subject to this limit. A value of lesser than or equal to zero disables this feature.IOException
- If an error occurs writing the index.public void close(boolean optimize) throws IOException
optimize
- If true then the index will be optimized before it is
closed.IOException
- If an error occurs optimizing or closing the index.public static String stripURL(String url)
Example:
http://www.bok.hi.is/?lang=ice
would become
http://bok.hi.is
url
- The url to stripCopyright © 2014 National and University Library of Iceland. All Rights Reserved.