Tutorials Quicklink
- Converting a SMART collection
- Adding a new collection conversion utility
- Parsing a collection
- Doing Single-query Vector-space Retrieval
- Doing Multi-query Vector-space Retrieval
Tutorial - Converting a SMART collection
- With TextMOLE open, click on the Convert tab.
- Press the Open File button, and browse to the SMART file you wish to convert to XML.
- Verify that SMART is selected as the conversion type, and then press the Convert button.
- It should only take a few seconds to convert, and you should see a black box pop up and then disappear when it is done. The new file in XML format will be located in the same directory as the original, with a .xml extension.
- This should be used for both queries and the collection.
- The SMART query results files, of the format "[querynum] [relevantdoc] 0 0" is already in the proper format and does not need to be converted.
Tutorial - Adding a new collection conversion utility to TextMOLE
- With the TextMOLE source code open, open up DataMine.cpp.
- The majority of this document is the WndProc function, a special windows API function that handles messages the program receives. In this function, locate the WM_CREATE case, which is a special message that gets sent only once, when the program is first executed. This message is used to handle initialization code.
- Inside the WM_CREATE case, the first batch of controls being created are for tab 0 - the convert tab. After the block of code that creates the Conversion type combobox, we now wish to add another item to its display:
strncpy(buffer, "<COLLECTION_NAME>", <NAME_LENGTH + 1>);
dwIndex= SendMessage(w, CB_ADDSTRING, 1, (LPARAM) (LPCSTR) buffer);
SendMessage(w, CB_SETITEMDATA, dwIndex, (LPARAM)(DWORD)1);
Note the changes from 0 to 1 in the index from the original SMART code. These numbers should be incremented for each conversion tool added.
- Now that we have added a new option to the control, we must tell the WndProc function to handle this new case when selected. Do a search for BtnDoConv or locate the case under WM_COMMAND/BN_CLICKED
- Look for the switch code, it should currently only have 1 case besides the default (for the smart conversion utility). Make a new case as follows:
case 1: // this case is for a new collection
s = "<COLLECTION_NAME> ";
s.append(sText);
// First argument must always be the file name location
s.append(" <DELIMITER>"); // Second argument is the document delimiter...
if (system(s.c_str()) == -1)
MessageBox(hWnd, "ERROR IN SYSTEM CALL", "ERROR", 0);
break;
Note the spaces after collection name and before delimiter. The delimiter is the tag in the document that points to the start of a new document. In some collections, this will just be the first field of each document.
- Assuming you have put your conversion executable in the same folder as TextMOLE, you are now finished.
Tutorial - Parsing a collection
NOTE: As of 03/05/06, you can only parse one collection per run (it does not yet deallocate memory nicely). If you want to parse another collection, please shut down TextMOLE and rerun it.
- With TextMOLE open, verify you are at the "Setup" tab (it should be the default one when opened).
- Click on the Open File button and choose a collection that is in proper XML format.
- TextMOLE will do an initial read-through of the collection to gather tag information. This should be barely noticeable with small collections.
- When it is done, the tag information will be populated in the "Document tag" and "All Fields" controls.
- First, verify that the tag displayed in "Document tag" is the proper document delimiter tag. TextMOLE assumes that the first tag it finds is the one that delimits documents. If this is not the case, change this combobox to the proper delimiter tag.
- Second, choose those tags in All Fields with data you want to be parsed by collection. You can choose a tag by double-clicking on it or by highlighting it and pressing the "=>" button.
There may be only one tag that can be parsed. For example, in MED, there is only the W (abstract) tag. In CISI, there are W, A (author), T (title), and other tags that could be useful.
- Select whether you want Porter stemming implemented or not. Stemming is generally recommended, although we kept the capacity to turn it off for classroom demonstration purposes or research tasks.
- Deselect the Use Default Stop List checkbox if you do not want to use the default stop list (defaultstop.txt). A stop list is a list of words that the parsing algorithm will throw out when it reads them in. The default stop list contains many of the most common words found in the english language.
- Select whether you want to use your own stop list in the "Use A Custom Stop List" checkbox. This can be used in addition to or instead of the default stop list. If
selected, you must press the Browse button to tell the parser the location of the custom stop file.
- Press the Parse button and wait for the progress bar to finish. Information about the parsing will appear in the screen to the left when it is done. Furthermore, the number of documents in the collection and the number of global terms will appear in the status bar.
Tutorial - Single-query Vector-space Retrieval
NOTE: You must have first parsed a collection in the "Setup" tab
- After parsing a collection, click on the "Search" tab.
- Choose which weighting parameters you want under Document Weighting and Query Weighting. To do vector-space retrieval without any special weighting, leave these untouched.
At the top, below the tabs, there is a text box with a Search button next to it. Type in your query here and press the Search button.
- The search time should be barely noticeable. When it is done, it will switch you to the "Results" tab. The number of documents it returned will show up in the status bar. The query results will appear on the lefthand side, along with their relevance score.
- Click on a query to get the full text extracted from the collection. Go back to the "Search" tab to run a new query.
Tutorial - Multi-query Vector-space Retrieval
NOTE: You must have first parsed a collection in the "Setup" tab
- After parsing a collection, click on the "Search" tab.
- Choose which weighting parameters you want under Document Weighting and Query Weighting. To do vector-space retrieval without any special weighting, leave these untouched.
- Click on the Load Query File button to load a file containing the multiple queries.
Important: These queries must be in XML format.
In addition, they must be delimited by the <W> tag. This is a bug that will be fixed soon (03/05/06).
- Click on the Load Results File button to load the file describing which documents were relevant to which queries.
Important: This results file MUST be in the form: [querynum] [docrelevant]. Any other information (such as more 0 fields) on each line will be discarded. But the first two items on each line must be those.
- Next press the Configure button and select the retrieval parameters you want.
Important: In this version, the File Information textbox is disabled for debugging purposes. This is hardcoded in as needing the "W" tag.
- Finally, press the Run Analysis button.
- The progress bar will monitor your progress. When it is done, it will switch you to the "Results" tab.
It will show you the results of the queries as you detailed in the Configure pane.
- Go back to the "Search" tab to run a new search.
For questions, comments, or corrections, please email Dan Waegel