0
I Use This!
Inactive

Commits : Listings

Analyzed about 8 hours ago. based on code collected about 12 hours ago.
Apr 19, 2023 — Apr 19, 2024
Commit Message Contributor Files Modified Lines Added Lines Removed Code Location Date
Use nutch:tstamp as date in wera Re: [Archive-access-discuss] Nutchwax0.10 and WERA0.4.2: Date field missing from documentLocator->Resultset More... over 16 years ago
*** empty log message *** More... about 17 years ago
* src/java/org/archive/access/nutch/ImportArcs.java Refactor reporter. Add new methods that will report if a long time has elapsed since last report, otherwise, will stay silent. Also fix a bug where we didn't always log the ARC name just opened if happened in same millisecond as Reporter construction. Made an ImportArcReporter out of an anonymous Reporter. More... about 17 years ago
BUGFIX: 1639135 [wayback] ArcProxy does not close connections If output stream to machine running HTTP11ResourceStore closes, as is the usual case, when reading a range from an ARC file, and the end of record is encountered, the inputstream from the ARC source was not being closed. On installations where the HTTP11ResourceStore and the ArcProxy are running on the same host, the connections time out more quickly, but in distributed installations, this was quickly becoming a problem. More... about 17 years ago
* project.xml Move version past release. More... about 17 years ago
* src/articles/releasenotes.xml Add in 0.10.0 bugs. More... over 17 years ago
* bin/importArcsLogReporter.py * bin/nutchwaxLogReporter.py * bin/util.py These don't currently work. Remove for now. More... over 17 years ago
* xdocs/index.xml Add link to release notes. More... over 17 years ago
Readying for 0.10.0 release. * conf/hadoop-site.template.xml Edit. * conf/wax-default.xml Add to wax.index.all description. * src/articles/releasenotes.xml Ready release notes for 0.10.0 release. * src/java/overview.html Edit to match 0.10.0. * xdocs/index.xml News of 0.10.0. * project.xml Set version to 0.10.0. More... over 17 years ago
Implement '[ 1288990 ] Configurable collection name in search.jsp': * conf/hadoop-site.xml.template Add note on how to override collection from search result. * conf/wax-default.xml (wax.host): Added. * src/plugin/parse-waxext/src/java/org/apache/nutch/parse/ext/WaxExtParser.java Remove debugging statement. * src/web/search.jsp If path on wax.host, don't add archiveCollection to result URL path. More... over 17 years ago
Implement '[ 1503045 ] PDFs have URL for title' * src/plugin/parse-waxext/src/java/org/apache/nutch/parse/ext/WaxExtParser.java Added looking for title at head of returned text from xpdf. (main): Added. More... over 17 years ago
* src/articles/releasenotes.xml Add TODO. More... over 17 years ago
Trying pdfinfo getting title from pdf. * src/plugin/parse-waxext/bin/parse-pdf.sh Get pdf metainfo. * src/plugin/parse-waxext/src/java/org/apache/nutch/parse/ext/WaxExtParser.java Try parsing a title from returned parse stream. More... over 17 years ago
removed -- replaced by index-client More... over 17 years ago
* src/java/org/archive/access/nutch/Nutchwax.java Fix classcastexception. * src/java/org/archive/access/nutch/NutchwaxCrawlDb.java Output the list of segment directories we're using to update. More... over 17 years ago
* src/java/org/archive/access/nutch/ImportArcs.java Add set status every 5 minutes if reading is taking a long time (We seem to be blocking on S3). More... over 17 years ago
Part of [ 1632531 ] [nutchwax] Use parse-pdf in place of xpdf * conf/wax-default.xml * conf/wax-parse-plugins.xml Use parse-pdf in place of parse-waxext. It finds the title which is an advantage (rather than use the URL) and it doesn't spawn an external process. Downsides are it seems to take longer to complete parse and it used to hang. Lets try it for a while to see if it works (Max Schoeffman tried it and its working for him). More... over 17 years ago
Apply '[ 1636313 ] [nutchwax-wayback] If exact date passed, use it' Contributed by Max Schoeffman. Reviewed by St.Ack * src/java/org/archive/wayback/resourceindex/NutchResourceIndex.java From Max: I looked a bit closer at waybacks ReplayFilter today and noticed that the current NutchResourceIndex (as a result of my last patch) doesn't behave absolutely correct. The ReplayFilter sets the date supplied with the URL as EXACT_DATE and the current timestamp as END_DATE. The NutchResourceIndex now constructs the date range for nutch from START_DATE to END_DATE, which seems wrong as this will always just return a version between 1996 and "today". The attached patch changes this to give precedence to EXACT_DATE over END_DATE if the former is specified. More... over 17 years ago
TWEAK: Upped revision to 0.9.0 More... over 17 years ago
* xdocs/user_manual.xml Point explicitly at the nutchwax-wayback bridge doc. More... over 17 years ago
RELEASE: 0.8.0 More... over 17 years ago
TWEAK: removed link to status line. More... over 17 years ago
TWEAK: changes in this derived file reflect changes in src/config/*.xml More... over 17 years ago
FEATURE: added two new regex removals for .NET session IDs embedded in URLs. More... over 17 years ago
FEATURE: added bdb-client and bin-search to command line tool exports More... over 17 years ago
TWEAK: changed PipelineFilter name to RemoteSubmitFilter. changed filter path from /pipeline/ to /index-incoming/ -- this filter no longer provides status on the pipeline and configuration. More... over 17 years ago
BUGFIX: proxy.redirect needs to be an absolute URL -- otherwise it ends up being relative to the server being viewed. More... over 17 years ago
TWEAK: changed default ArcProxy context from locationdb to arc-proxy, and filter directory to proxied arcs from /arc-proxy/ to /arcs/. More... over 17 years ago
TWEAK: changed filter export from /arc-proxy/ to /arcs/ More... over 17 years ago
TWEAK: whitespace and documentation More... over 17 years ago