Abstract
When people access the web, we can classify their activities into two broad categories. They are either searching for specific information, or they are browsing(Marko & Yoav 1995). There have been several efforts to support browsing activity and searching. Searching though narrower than browsing, can sometimes be time-consuming, given the current exponential growth in the volume of information accessible over the Internet. This is a proposal to develop a system which will help narrow down the domain of the search. The system “learns” to identify words which the user will find interesting. It first presents the user with a list of keywords, from the document in the master window, which he evaluates as RELEVANT or NOT-RELEVANT and this information is used as the standard for the learning system which maintains a “user preferences” file. The system then creates a clickable list of “interesting words” in a slave window. The user can open up the documents one after the other and he can read the context “relevant” to him by just clicking on the keyword (which is chosen by the trained system) in the slave window. The goal of this system, with its summarizer and indexer, is to impove browsing capabilities within the document which itself is the result of a search. It thus increases the value of the html document as a whole.
Definitions:
- KEYWORD: It is a word which the system evaluates to be on a high priority category using position, frequency and other factors.
- INTERESTING WORD: It is a narrower category of words and contains the words which the system after training, will present as a clickable index in the slave window.
A Real World Analogy Of The Problem:
X is taking a course in Networking and wants to read about the OSI layering. He goes to the library and gets the results of the subject search “Networks”. He now has a long list of call numbers and it could take a long time for him to scan through all the books and get a narrower list of books which actually contain a detailed explanation of the OSI layers. It would have been easier if he could more easily get to the level where he has the list of books containing information on the specific “interesting word”. It is a similar situation in internet search. A keyword search of a document can be too broad for a user and to get to the documents which will be of immediate interest to him could be hectic.
To alleviate the above situation, this is a proposal to discuss an intelligent agent, hereafter called the INDEXER.
Architectural Components:
- Netscape 1.1N
- Perl
- A custom proxy server: It acts as an intermediate server which provides a path through which all browser requests pass.
- An INDEXER program
- SUMMARIZER program by Joe Felder will be included in the package.
- A master client(Netscape window in which document will be displayed).
- An INDEXER slave client(Netscape window with clickable keywords).
- A SUMMARIZER slave client(Netscape window in which TOC will be displayed).
Solution Approach:
The basis will be using the vector space information retrieval paradigm. Word weights will be found using a specific schema and a “interestingness” value will be assigned to each of the words. Learning algorithm will be supplied with number of keywords and number of documents. The document will be represented as vector of D, with each element being a keyword. The document also has a vector V, where each element is the weight of that word in the document. Word weights will be found using an appropriate weighting formulae. Now the words are presented on the menu screen to the user so that can be weighted on a scale of -5 to +5. The system trains itself using an appropriate Learning algorithm to identify the appropriate words.
The Blue Print For Indexing:
1. Obtain training document (Display in master window).
2. Identify individual text words.
3. Use stop list to delete common words.
4. Use suffix stripping algorithms.
5. Identify the retrieved words as relevant(score=1) and non-relevant(score=0) to the user.
6. Compute term weights of relevant words using prescribed formula.
7. Place words and weights in user preferences file.
8. Obtain new document and repeat steps 2-4.
9. Place the keywords in a document vector.
10. Find similarity coefficient of u.p.f and document vector.
11. Find weights of newly found terms.
12. Reformulate contents of u.p.f. using relevance feedback formula.
13. Create a clickable index of “interesting words” which are the contents of u.p.f. (Display in slave window)
14. Return to step 8
Learning and Relevance Feedback:
The assumption is that the user has a specific subject in mind and already has a broad list of documents containing references to the subject. So all documents are related to the subject. The user has the choice of introducing one or more training documents, else the default will be the contents of the user preferences file.
In case the user chooses to provide the training document which is preferably the document with the highest hit rate in the keyword search, it is opened in the master window. The INDEXER identifies the key-words from this document. It presents this list to the user, who in turn evaluates the words with scores of 0(not-relevant) or 1(relevant). This is done till the user is satisfied with the training data which is stored in a user preferences file along with the corresponding weights computed by the INDEXER using the following formula w(i,j) = t(i,j)*log(N/d(i)) where w(i,j) = weight of ith word in jth document, t(i,j) = ith term frequency in jth document, N = number of documents evaluated, d(i) = number of documents in which word i appears
The preference vector is one whose elements are the weights of the relevant words placed in order. P = w(i,j) for all i, j
Once the system is trained to satisfaction, a new document is retrieved and its keywords are determined. The weights of the new words, d(i,j), found using the above formula are placed in a document vector.
D = d(i,j) for all i,j The similarity coefficient allows the system to adjust its parameters and makes a more precise list of “interesting” words. The system undergoes training till it can identify the “interesting” words on its own, for the rest of the documents. It then displays a clickable index of “interesting” words for each of the documents.
Future Work
- . Integrating Learning capacity to INDEXER
- “Stemming of words” will be included as part of INDEXER and will be based on the heuristics used by the ‘SMART’ system.
- It would be nice to have a pop up window to set up selection heuristics and display learning But for now it will be done transparently.
- The implementation of more advanced Visual techniques to choose keywords (TAU system, Swaminathan 1993) can be added as a further development but may not be included in this project.
- “Synonym analysis” can also be added as a feature of the system.