wwwChecker
Download
What is this?
Web update checker.
Features
- Simple CLI application that generates a HTML file which shows a list of update information.
- Suppress false detection caused by dynamical pages (e.g. Ads) without any black-listed words.
- Extract updated parts of contents.
- Reduce network traffic to use requests with
If-since-modified
and Accept-encoding: gzip
headers, check Last-modified
and Content-length
headers before downloading contents.
- N-parallel retrieving.
- Depend on Python only. Cross platform.
Method
- Flatten HTML DOM tree to the sequence of paragraphs.
- Apply diff algorithm to detect inserted/deleted paragraphs.
- Filter out irrelevant changes, which uses a linear combination of standard scores of (#[anchored text] / #[whole text]) and (log #[whole text]) per pages ("#[X]" means "the length of X").
Usage
- Please write URIs one per line in
~/.www-list
file.
python /path/to/wwwChecker
- A Web browser will be automatically started on finished. If not, please open a
~/.www-check.html
manually.
cd ../
Yasuhiro Fujii <y-fujii at mimosa-pudica.net>