A C D E F G H I L M N O P Q R S T U V W 

A

A_CONTENT_STATE_KEY - Static variable in interface is.landsbokasafn.deduplicator.DedupAttributeConstants
Key to use getting state of crawluri from the CrawlURI data.
afterPropertiesSet() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
ATTR_ANALYZE_TIMESTAMP - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
ATTR_EQUIVALENT - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
ATTR_FILTER_MODE - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
ATTR_JUMP_TO - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
ATTR_MIME_FILTER - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
ATTR_ORIGIN - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
ATTR_ORIGIN_HANDLING - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
ATTR_STATS_PER_HOST - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
ATTR_USE_SPARSE_RANGE_FILTER - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 

C

close() - Method in class is.landsbokasafn.deduplicator.CrawlDataIterator
Close any resources held open to read the crawl data.
close() - Method in class is.landsbokasafn.deduplicator.CrawlLogIterator
Closes the crawl.log file.
close(boolean) - Method in class is.landsbokasafn.deduplicator.DigestIndexer
Close the index.
CommandLineParser - Class in is.landsbokasafn.deduplicator
Print DigestIndexer command-line usage message.
CommandLineParser(String[], PrintWriter) - Constructor for class is.landsbokasafn.deduplicator.CommandLineParser
Constructor.
CommandLineParser.DigestHelpFormatter - Class in is.landsbokasafn.deduplicator
Override so can customize usage output.
CommandLineParser.DigestHelpFormatter() - Constructor for class is.landsbokasafn.deduplicator.CommandLineParser.DigestHelpFormatter
 
CONTENT_CHANGED - Static variable in interface is.landsbokasafn.deduplicator.DedupAttributeConstants
URI content had changed between the two latest, successfully completed fetches.
CONTENT_UNCHANGED - Static variable in interface is.landsbokasafn.deduplicator.DedupAttributeConstants
URI content has not changed between the two latest, successfully completed fetches.
CONTENT_UNKNOWN - Static variable in interface is.landsbokasafn.deduplicator.DedupAttributeConstants
No knowledge of URI content.
contentDigest - Variable in class is.landsbokasafn.deduplicator.CrawlDataItem
 
CrawlDataItem - Class in is.landsbokasafn.deduplicator
A base class for individual items of crawl data that should be added to the index.
CrawlDataItem() - Constructor for class is.landsbokasafn.deduplicator.CrawlDataItem
Constructor.
CrawlDataItem(String, String, String, String, String, String, boolean, long) - Constructor for class is.landsbokasafn.deduplicator.CrawlDataItem
Constructor.
crawlDataItemFormat - Variable in class is.landsbokasafn.deduplicator.CrawlLogIterator
The date format specified by the CrawlDataItem for dates entered into it (and eventually into the index)
CrawlDataIterator - Class in is.landsbokasafn.deduplicator
An abstract base class for implementations of iterators that iterate over different sets of crawl data (i.e.
CrawlDataIterator(String) - Constructor for class is.landsbokasafn.deduplicator.CrawlDataIterator
Constructor.
crawlDateFormat - Variable in class is.landsbokasafn.deduplicator.CrawlLogIterator
The date format used in crawl.log files.
CrawlLogIterator - Class in is.landsbokasafn.deduplicator
An implementation of a is.hi.bok.deduplicator.CrawlDataIterator capable of iterating over a Heritrix's style crawl.log.
CrawlLogIterator(String) - Constructor for class is.landsbokasafn.deduplicator.CrawlLogIterator
Create a new CrawlLogIterator that reads items from a Heritrix crawl.log

D

dateFormat - Static variable in class is.landsbokasafn.deduplicator.CrawlDataItem
DedupAttributeConstants - Interface in is.landsbokasafn.deduplicator
Lifted from H1 AdaptiveRevisitAttributeConstants and limited to what DeDuplicator was using.
DeDupFetchHTTP - Class in is.landsbokasafn.deduplicator
An extentsion of Heritrix's org.archive.crawler.fetcher.FetchHTTP processor for downloading HTTP documents.
DeDupFetchHTTP() - Constructor for class is.landsbokasafn.deduplicator.DeDupFetchHTTP
 
DeDuplicator - Class in is.landsbokasafn.deduplicator
Heritrix compatible processor.
DeDuplicator() - Constructor for class is.landsbokasafn.deduplicator.DeDuplicator
 
DEFAULT_MIME_FILTER - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
DEFAULT_ORIGIN_HANDLING - Static variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
DigestIndexer - Class in is.landsbokasafn.deduplicator
A class for building a de-duplication index.
DigestIndexer(String, String, boolean, boolean, boolean, boolean) - Constructor for class is.landsbokasafn.deduplicator.DigestIndexer
Each instance of this class wraps one Lucene index for writing deduplication information to it.
doAnalysis(CrawlURI, Statistics, boolean) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
doTimestampAnalysis(CrawlURI, Document, Statistics, boolean) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
duplicate - Variable in class is.landsbokasafn.deduplicator.CrawlDataItem
 

E

etag - Variable in class is.landsbokasafn.deduplicator.CrawlDataItem
 

F

FIELD_DIGEST - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
The content digest as String
FIELD_ETAG - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
The document's etag
FIELD_ORIGIN - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
A field containing meta-data on where the original version of a document is stored.
FIELD_ORIGINAL_DATE - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
The date of the original payload capture.
FIELD_TIMESTAMP - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
The URLs timestamp (time of fetch).
FIELD_URL - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
The URL This value is suitable for use in warc/revist records as the WARC-Refers-To-Target-URI
FIELD_URL_NORMALIZED - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
A stripped (normalized) version of the URL
finalTasks() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 

G

getAnalyzeTimestamp() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getBlacklist() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getCommandLine() - Method in class is.landsbokasafn.deduplicator.CommandLineParser
 
getCommandLineArguments() - Method in class is.landsbokasafn.deduplicator.CommandLineParser
 
getCommandLineOptions() - Method in class is.landsbokasafn.deduplicator.CommandLineParser
 
getContentDigest() - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Returns the documents content digest
getEtag() - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Returns the etag that was associated with the document.
getIndexLocation() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getJumpTo() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getMatchingMethod() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getMimeFilter() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getMimeType() - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Returns the mimetype that was associated with the document.
getOrigin() - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Returns the "origin" that was associated with the document.
getOrigin() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getOriginHandling() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getPercentage(double, double) - Static method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getServerCache() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getSize() - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Get the size of the CrawlDataItem.
getSourceType() - Method in class is.landsbokasafn.deduplicator.CrawlDataIterator
A short, human readable, string about what source this iterator uses.
getSourceType() - Method in class is.landsbokasafn.deduplicator.CrawlLogIterator
 
getStatsPerHost() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getTimestamp() - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Returns a timestamp for when the URL was fetched in the format: yyyyMMddHHmmssSSS
getTryEquivalent() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
getURL() - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Returns the URL
getUseSparseRengeFilter() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 

H

hasNext() - Method in class is.landsbokasafn.deduplicator.CrawlDataIterator
Are there more elements?
hasNext() - Method in class is.landsbokasafn.deduplicator.CrawlLogIterator
Returns true if there are more items available.

I

in - Variable in class is.landsbokasafn.deduplicator.CrawlLogIterator
A reader for the crawl.log file being processed
innerProcess(CrawlURI) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
innerProcessResult(CrawlURI) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
is.landsbokasafn.deduplicator - package is.landsbokasafn.deduplicator
 
isDuplicate() - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Returns whether the CrawlDataItem was marked as duplicate.

L

lookupByDigest(CrawlURI, Statistics) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
Process a CrawlURI looking up in the index by content digest
lookupByURL - Variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
lookupByURL(CrawlURI, Statistics) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
Process a CrawlURI looking up in the index by URL

M

main(String[]) - Static method in class is.landsbokasafn.deduplicator.DigestIndexer
 
message(String, int) - Method in class is.landsbokasafn.deduplicator.CommandLineParser
Print message and then exit.
mimetype - Variable in class is.landsbokasafn.deduplicator.CrawlDataItem
 
MODE_BOTH - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
Both URL and hash are indexed
MODE_HASH - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
Index HASH enabling lookups by hash (content digest)
MODE_URL - Static variable in class is.landsbokasafn.deduplicator.DigestIndexer
Index URL enabling lookups by URL.

N

next() - Method in class is.landsbokasafn.deduplicator.CrawlDataIterator
Get the next CrawlDataItem.
next - Variable in class is.landsbokasafn.deduplicator.CrawlLogIterator
The next item to be issued (if ready) or null if the next item has not been prepared or there are no more elements
next() - Method in class is.landsbokasafn.deduplicator.CrawlLogIterator
Returns the next valid item from the crawl log.

O

origin - Variable in class is.landsbokasafn.deduplicator.CrawlDataItem
 
OriginHandling - Enum in is.landsbokasafn.deduplicator
 

P

parseLine(String) - Method in class is.landsbokasafn.deduplicator.CrawlLogIterator
Parse the a line in the crawl log.
perHostStats - Variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
prepareNext() - Method in class is.landsbokasafn.deduplicator.CrawlLogIterator
Ready the next item.
printUsage(PrintWriter, int, String) - Method in class is.landsbokasafn.deduplicator.CommandLineParser.DigestHelpFormatter
 
printUsage(PrintWriter, int, String, Options) - Method in class is.landsbokasafn.deduplicator.CommandLineParser.DigestHelpFormatter
 

Q

queryField(String, String) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
Run a simple Lucene query for a single term in a single field.

R

report() - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 

S

searcher - Variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
serverCache - Variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
setAnalyzeTimestamp(boolean) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setBlacklist(boolean) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setContentDigest(String) - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Set the content digest
setDuplicate(boolean) - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Set whether duplicate or not.
setEtag(String) - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Set a new Etag
setIndexLocation(String) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setJumpTo(String) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setMatchingMethod(MatchingMethod) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setMimeFilter(String) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setMimeType(String) - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Set new MIME type.
setOrigin(String) - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Set new origin
setOrigin(String) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setOriginHandling(OriginHandling) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setServerCache(ServerCache) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setSize(long) - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Set the size of the CrawlDataItem
setStatsPerHost(boolean) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setTimestamp(String) - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Set a new timestamp.
setTryEquivalent(boolean) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
setURL(String) - Method in class is.landsbokasafn.deduplicator.CrawlDataItem
Set the URL
setUseSparseRengeFilter(boolean) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
shouldProcess(CrawlURI) - Method in class is.landsbokasafn.deduplicator.DeDuplicator
 
size - Variable in class is.landsbokasafn.deduplicator.CrawlDataItem
 
stats - Variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
statsPerHost - Variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
statusCode - Variable in class is.landsbokasafn.deduplicator.CrawlDataItem
 
stripURL(String) - Static method in class is.landsbokasafn.deduplicator.DigestIndexer
An aggressive URL normalizer.

T

timestamp - Variable in class is.landsbokasafn.deduplicator.CrawlDataItem
 

U

URL - Variable in class is.landsbokasafn.deduplicator.CrawlDataItem
 
usage() - Method in class is.landsbokasafn.deduplicator.CommandLineParser
Print usage then exit.
usage(int) - Method in class is.landsbokasafn.deduplicator.CommandLineParser
Print usage then exit.
usage(String, int) - Method in class is.landsbokasafn.deduplicator.CommandLineParser
Print message then usage then exit.
useOrigin - Variable in class is.landsbokasafn.deduplicator.DeDuplicator
 
useOriginFromIndex - Variable in class is.landsbokasafn.deduplicator.DeDuplicator
 

V

valueOf(String) - Static method in enum is.landsbokasafn.deduplicator.OriginHandling
Returns the enum constant of this type with the specified name.
values() - Static method in enum is.landsbokasafn.deduplicator.OriginHandling
Returns an array containing the constants of this enum type, in the order they are declared.

W

writeToIndex(CrawlDataIterator, String, boolean, String, boolean) - Method in class is.landsbokasafn.deduplicator.DigestIndexer
Writes the contents of a CrawlDataIterator to this index.
writeToIndex(CrawlDataIterator, String, boolean, String, boolean, boolean, long) - Method in class is.landsbokasafn.deduplicator.DigestIndexer
Writes the contents of a CrawlDataIterator to this index.
A C D E F G H I L M N O P Q R S T U V W 

Copyright © 2014 National and University Library of Iceland. All Rights Reserved.