Dream 57

home *** CD-ROM | disk | FTP | other *** search

/ Dream 57 / Amiga_Dream_57.iso / Linux / Net / websec.txt < prev next >

Wrap

Text File | 1998-10-10 | 8KB | 214 lines

WEB SECRETARY Version 1.11 1. OVERVIEW Web Secretary is a web page monitoring software. However, it goes beyond the normal functionalities offered by such software. Not only does it detect changes based on content analysis (instead of date/time stamp or simple textual comparison), it will email the changed page to you WITH THE NEW CONTENT HIGHLIGHTED! Web Secretary is actually a suite of two Perl scripts called websec and webdiff. websec retrieves web pages and email them to you based on a URL list that you provide. webdiff compares two web pages (current and archive) and creates a new page based on the current page but with all the differences highlighted using a predefined color. If you are a Web junkie who monitors a large number of web pages regularly like me, then you should find Web Secretary very useful. 2. DEPENDENCIES Web Secretary should be able to run on all Unix systems. At present, it has only been tested on Linux. Web Secretary requires a Perl interpreter on your system to run the scripts. It also relies on 'lynx' to retrieve web pages, 'metasend' to send the web pages, and 'mail' to send warning messages when a web page cannot be retrieved. 3. INSTALLATION AND CONFIGURATION Installing Web Secretary is easy. - Un-tar the distribution. The files will be uncompressed into a directory called websec/. - Change directory to websec/. - Edit the first lines in websec and webdiff to reflect the actual location of the Perl interpreter on your system. - Edit the URL list called url.list. Please refer to SECTION 5 for more information on this. - Edit the ignore keyword files general.ignore. Please refer to SECTION 6 for more information on this. 4. USAGE You can run Web Secretary whenever you want to monitor the changes in your URL list by typing 'websec <URL list>'. Alternatively, you can add Web Secretary to your crontab and run it on a regular basis (eg. daily). 5. URL LIST The URL list consists of one or more sections separated by newlines. The following keywords are recognized in each section: URL - URL of web page to monitor Auth - Authentication information in userid:passwd format. Put "none" if no authentication needed. Name - Name of web site. Pages delivered to you will have the following format: "Name - Date (Day)" eg. "PC Magazine - 4 Sep 98 (Fri)" Prefix - Prefix of filenames for archive files of web pages created by Web Secretary. Diff - Put "none" if you want Web Secretary to always mail this page to you instead of checking for and highlighting changes in the page. Put "webwiff" if you want Web Secretary to check for changes. Hicolor - Color used to highlight new or changed content. Currently, four colors are defined. They are: blue, pink, yellow and grey. Ignore - Comma-delimited List of files containing ignore keywords. There must be NO SPACES between delmiters and filenames. Email - Email address to send changed pages to. Any line which begins with a '#' is treated as comment and ignored. If a section does not contain a URL entry, the values provided will be treated as the default for the following sections. For example, # Defaults Auth = none Diff = webdiff Hicolor = blue Ignore = general.ignore,months.ignore Email = vchew@pos1.com # Web page to monitor which does not require authentication URL = http://browserwatch.iworld.com/news.html Name = Browser Watch Prefix = browsewatch # New defaults with authentication information Auth = user:password # More web pages to monitor which requires authentication URL = http://www.infoworld.com Name = Infoworld Prefix = infoworld URL = http://developer.javasoft.com/ Name = Java Developer Central Prefix = jdc 6. IGNORE KEYWORDS Ignore keywords are useful when you want to ignore sections which contains certain words when determining whether a particular section is new or has changed. For example, pages like InfoWorld, PC Magazine and PC Week contains the current date regardless of whether there is new or changed content. In such cases, you might want to ignore any section which contains month information. You can also use ignore keywords to skip sections which contains online ads and other irrelevant information. To use ignore keywords, prepare a text file containing all the ignore keywords delmited by the newline. Remember keyword matching is performed at the word boundaries, so substring matching is not applicable. Then, in the appropriate section in the URL list, insert a line: Ignore = mykeywords.ignore If you want to use multiple ignore keyword files, use: Ignore = mykeywords1.ignore,mykeywords2.ignore ... etc If you use certain ignore keyword files regularly, you might want to add it in a defaults section in the URL list. Three ignore keyword files are supplied by default. months.ignore contains all the months of the year and their shortforms. days.ignore contains similar information for days of the week. general.ignore contains some general ignore keywords which you may want to use. 7. HISTORY 1.11 - Released on 10 Oct 1998 * Minor modification to the comparison algorithm so that it won't be fooled by extra spaces in the tokens. 1.1 - Released on 25 Sep 1998 * Improved the detection algorithm for multiple consecutive mangled HTML tags so that they will not be incorrectly highlighted. * Support for Javascript and stylesheet tags so that they will not be incorrectly highlighted. 1.0 - First released on 4 Sep 1998. The idea for this tool originated from a software package called Tierra Highlights for the PC (http://www.tierra.com). I tried it out for a while and found it to be extremely useful. However, like most PC tools, it was closely tied to the PC that you installed the software on. If you are working on some other computer, you will not be unable to access the pages being monitored. At that time, I was already convinced that email is the best "push" platform the world has ever seen, so why not deliver the changed pages via email? I bounced the idea around for a while amongst friends and colleagues, and when I could not find any sucker to write this for me :-), I wrote the first version in a crazy moment of unrest using shell script. However, this first version was not very configurable, so I quickly wrote the second version in Perl. So far, however, the program does nothing but retrieve pages and email them to you. I quickly added a quick hack to do a diff between an archive page and the current page before deciding whether to email the page, but the scheme proved too brittle for detecting changes in most cases. I lived with this scheme for a while. Finally, lunacy got the better of me. I figured out a quick and dirty way of doing what Tierra Highlights does, and actually thought I could implement the while idea in one day. It took two days instead, and the inital version sucked like hell and failed miserably on many pages. However, you should have seen the grin on my face when it highlighted PC Magazine and PC Week properly. :-) Like most programmers who are crazy enough to think that they can do "this thing" in one day, I spent the next two weeks feverishly debugging the project. Everyday, I will add new pages to the URL list, and debug those which failed to be highlighted. Finally, I have something which I can use everyday and is prepared to share with the rest of the world. 8. ACKNOWLEDGEMENT I would like to thank the GNU people. I don't know them personally, but they have blessed us with free and great tools such as Linux, gcc, emacs, Perl, fetchmail etc. which I now use on a daily basis. In the trails of their selfless spirit, I will also like to share Web Secretary in the same way, and hope many people besides me find it useful. I would also like to thank Chng Tiak Jung, a friend and mentor who inspires me to learn at least one new thing everyday. I am sure if he continues at his current pace, I will never be able to catch up with him!