Simple CLI application that generates a HTML file which shows a list of update information.
Suppress false detection caused by dynamical pages (e.g. Ads) without any black-listed words.
Extract updated parts of contents.
Reduce network traffic to use requests with If-since-modified and Accept-encoding: gzip headers, check Last-modified and Content-length headers before downloading contents.
N-parallel retrieving.
Depend on Python only. Cross platform.
Method
Flatten HTML DOM tree to the sequence of paragraphs.
Apply diff algorithm to detect inserted/deleted paragraphs.
Filter out irrelevant changes, which uses a linear combination of standard scores of (#[anchored text] / #[whole text]) and (log #[whole text]) per pages ("#[X]" means "the length of X").
Usage
Please write URIs one per line in ~/.www-list file.
python /path/to/wwwChecker
A Web browser will be automatically started on finished. If not, please open a ~/.www-check.html manually.