0
I Use This!
Inactive
Analyzed about 18 hours ago. based on code collected about 22 hours ago.

Project Summary

An efficient tool to process (mainly) ARC files from Heritrix (HTML stripper, entity converter, boilerplate remover, language filter). Converts everything to UTF-8 using ICU. Also includes a conservative w-shingling implementation for near-duplicate detection. Uses multi-threading for single-machine parallelization.

Tags

No tags have been added

In a Nutshell, texrex...

GNU General Public License v3.0 or later
Permitted

Commercial Use

Modify

Distribute

Place Warranty

Use Patent Claims

Forbidden

Sub-License

Hold Liable

Required

Distribute Original

Disclose Source

Include Copyright

State Changes

Include License

Include Install Instructions

These details are provided for information only. No information here is legal advice and should not be used as such.

This Project has No vulnerabilities Reported Against it

Did You Know...

  • ...
    use of OSS increased in 65% of companies in 2016
  • ...
    you can subscribe to e-mail newsletters to receive update from the Open Hub blog
  • ...
    there are over 3,000 projects on the Open Hub with security vulnerabilities reported against them
  • ...
    search using multiple tags to find exactly what you need

30 Day Summary

Mar 25 2024 — Apr 24 2024

12 Month Summary

Apr 24 2023 — Apr 24 2024

Ratings

Be the first to rate this project
Click to add your rating
  
Review this Project!