Apr 17, 2023
—
Apr 17, 2024
|
||||||
Commit Message | Contributor | Files Modified | Lines Added | Lines Removed | Code Location | Date |
---|---|---|---|---|---|---|
filter out pages without text; these occur in Wikinews dumps | More... | about 11 years ago | ||||
decouple extract_pages from LXML | More... | over 11 years ago | ||||
improved markup stripping | More... | over 11 years ago | ||||
document parameter to extract_pages | More... | almost 12 years ago | ||||
update to XML dump format 0.6 + refactor extract_pages | More... | almost 12 years ago | ||||
implemented: parsing category dumps and page dumps, cleaning text | More... | about 12 years ago |