home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Dream 57
/
Amiga_Dream_57.iso
/
Linux
/
Net
/
websec.txt
< prev
next >
Wrap
Text File
|
1998-10-10
|
8KB
|
214 lines
WEB SECRETARY Version 1.11
1. OVERVIEW
Web Secretary is a web page monitoring software. However, it goes beyond the
normal functionalities offered by such software. Not only does it detect
changes based on content analysis (instead of date/time stamp or simple
textual comparison), it will email the changed page to you WITH THE NEW
CONTENT HIGHLIGHTED!
Web Secretary is actually a suite of two Perl scripts called websec and
webdiff. websec retrieves web pages and email them to you based on a URL
list that you provide. webdiff compares two web pages (current and archive)
and creates a new page based on the current page but with all the
differences highlighted using a predefined color.
If you are a Web junkie who monitors a large number of web pages regularly
like me, then you should find Web Secretary very useful.
2. DEPENDENCIES
Web Secretary should be able to run on all Unix systems. At present, it has
only been tested on Linux.
Web Secretary requires a Perl interpreter on your system to run the scripts.
It also relies on 'lynx' to retrieve web pages, 'metasend' to send the web
pages, and 'mail' to send warning messages when a web page cannot be
retrieved.
3. INSTALLATION AND CONFIGURATION
Installing Web Secretary is easy.
- Un-tar the distribution. The files will be uncompressed into a directory
called websec/.
- Change directory to websec/.
- Edit the first lines in websec and webdiff to reflect the actual location
of the Perl interpreter on your system.
- Edit the URL list called url.list. Please refer to SECTION 5 for more
information on this.
- Edit the ignore keyword files general.ignore. Please refer to SECTION 6
for more information on this.
4. USAGE
You can run Web Secretary whenever you want to monitor the changes in your
URL list by typing 'websec <URL list>'.
Alternatively, you can add Web Secretary to your crontab and run it on a
regular basis (eg. daily).
5. URL LIST
The URL list consists of one or more sections separated by newlines.
The following keywords are recognized in each section:
URL - URL of web page to monitor
Auth - Authentication information in userid:passwd format.
Put "none" if no authentication needed.
Name - Name of web site. Pages delivered to you will have the
following format: "Name - Date (Day)" eg. "PC Magazine - 4 Sep
98 (Fri)"
Prefix - Prefix of filenames for archive files of web pages created by
Web Secretary.
Diff - Put "none" if you want Web Secretary to always mail this page
to you instead of checking for and highlighting changes in the
page. Put "webwiff" if you want Web Secretary to check for
changes.
Hicolor - Color used to highlight new or changed content. Currently,
four colors are defined. They are: blue, pink, yellow and
grey.
Ignore - Comma-delimited List of files containing ignore keywords.
There must be NO SPACES between delmiters and filenames.
Email - Email address to send changed pages to.
Any line which begins with a '#' is treated as comment and ignored.
If a section does not contain a URL entry, the values provided will be
treated as the default for the following sections.
For example,
# Defaults
Auth = none
Diff = webdiff
Hicolor = blue
Ignore = general.ignore,months.ignore
Email = vchew@pos1.com
# Web page to monitor which does not require authentication
URL = http://browserwatch.iworld.com/news.html
Name = Browser Watch
Prefix = browsewatch
# New defaults with authentication information
Auth = user:password
# More web pages to monitor which requires authentication
URL = http://www.infoworld.com
Name = Infoworld
Prefix = infoworld
URL = http://developer.javasoft.com/
Name = Java Developer Central
Prefix = jdc
6. IGNORE KEYWORDS
Ignore keywords are useful when you want to ignore sections which contains
certain words when determining whether a particular section is new or has
changed.
For example, pages like InfoWorld, PC Magazine and PC Week contains the
current date regardless of whether there is new or changed content. In such
cases, you might want to ignore any section which contains month
information.
You can also use ignore keywords to skip sections which contains online ads
and other irrelevant information.
To use ignore keywords, prepare a text file containing all the ignore
keywords delmited by the newline. Remember keyword matching is performed at
the word boundaries, so substring matching is not applicable.
Then, in the appropriate section in the URL list, insert a line:
Ignore = mykeywords.ignore
If you want to use multiple ignore keyword files, use:
Ignore = mykeywords1.ignore,mykeywords2.ignore ... etc
If you use certain ignore keyword files regularly, you might want to add it
in a defaults section in the URL list.
Three ignore keyword files are supplied by default. months.ignore contains
all the months of the year and their shortforms. days.ignore contains similar
information for days of the week. general.ignore contains some general
ignore keywords which you may want to use.
7. HISTORY
1.11 - Released on 10 Oct 1998
* Minor modification to the comparison algorithm so that it won't be fooled
by extra spaces in the tokens.
1.1 - Released on 25 Sep 1998
* Improved the detection algorithm for multiple consecutive mangled HTML
tags so that they will not be incorrectly highlighted.
* Support for Javascript and stylesheet tags so that they will not be
incorrectly highlighted.
1.0 - First released on 4 Sep 1998.
The idea for this tool originated from a software package called Tierra
Highlights for the PC (http://www.tierra.com). I tried it out for a while
and found it to be extremely useful. However, like most PC tools, it was
closely tied to the PC that you installed the software on. If you are
working on some other computer, you will not be unable to access the pages
being monitored. At that time, I was already convinced that email is the
best "push" platform the world has ever seen, so why not deliver the changed
pages via email?
I bounced the idea around for a while amongst friends and colleagues, and
when I could not find any sucker to write this for me :-), I wrote the first
version in a crazy moment of unrest using shell script. However, this first
version was not very configurable, so I quickly wrote the second version in
Perl.
So far, however, the program does nothing but retrieve pages and email them
to you. I quickly added a quick hack to do a diff between an archive page
and the current page before deciding whether to email the page, but the
scheme proved too brittle for detecting changes in most cases.
I lived with this scheme for a while. Finally, lunacy got the better of me.
I figured out a quick and dirty way of doing what Tierra Highlights does,
and actually thought I could implement the while idea in one day. It took
two days instead, and the inital version sucked like hell and failed
miserably on many pages. However, you should have seen the grin on my face
when it highlighted PC Magazine and PC Week properly. :-)
Like most programmers who are crazy enough to think that they can do "this
thing" in one day, I spent the next two weeks feverishly debugging the
project. Everyday, I will add new pages to the URL list, and debug those
which failed to be highlighted. Finally, I have something which I can use
everyday and is prepared to share with the rest of the world.
8. ACKNOWLEDGEMENT
I would like to thank the GNU people. I don't know them personally, but they
have blessed us with free and great tools such as Linux, gcc, emacs, Perl,
fetchmail etc. which I now use on a daily basis. In the trails of their
selfless spirit, I will also like to share Web Secretary in the same way,
and hope many people besides me find it useful.
I would also like to thank Chng Tiak Jung, a friend and mentor who inspires
me to learn at least one new thing everyday. I am sure if he continues at
his current pace, I will never be able to catch up with him!