WORDSTAT TEXT QUANTIFICATION PROCESSES
Text mining, as performed by WordStat, involves some form of quantification of text data. Such quantification is achieved by applying natural language processing techniques (stemming, lemmatization, removal of stop words, etc.), statistical selection criteria, as well as grouping of words and phrases into concepts using either taxonomies or custom content dictionaries. All these procedures can be combined to extract numbers representing the presence or frequency of important keywords, or key concepts. We call this the categorization process. WordStat also supports another form of quantification: automatic document classification, by which documents are categorized in one of several mutually exclusive classes using some form of machine learning.
WHY A SOFTWARE DEVELOPMENT KIT?
The categorization and classification processes are performed by WordStat, which offers a graphical user interface that allows a user to create, validate and refine those processes, apply these to various text collections, perform comparisons, explore, relate and create graphical and tabular reports. While categorization and classification models can be saved to disk and reapplied on a different set of documents, a human operator is still required to perform those analysis, limiting the ability to fully automate the text analysis and reporting operations.
The WordStat software development kit (SDK) provides a solution , allowing models developed with the WordStat desktop tool to be used in other applications written in other computer languages such as C++, Delphi, C#, VB.Net and so on.
An example of such integration would be the application of a categorization model on a company data collection system of customer feedback in order to automatically measure references to specific topics and to classify those feedbacks as either positive, negative or neutral.
© All rights reserved Provalis Research 2024
APPLYING THE SDK
All the analysis and text transformation settings set in WordStat are stored on disk in the model files (stemming, lemmatization, categorization rules, selection criteria, etc.). This greatly simplifies the integration of such text processing in other applications by reducing the application of those text analysis process to four easy steps:
- Load the categorization or classification model file
- Retrieve the text to categorize or classify
- Apply the model to the text
- Retrieve relevant information (frequencies, probabilities, predicted classes, etc.)
- A model only needs to be loaded once, while steps #2 to #4 may be repeated as often as needed.
There are currently no reporting or graphing functions available in the DLL, so it is the task of the programmer to further process the obtained information. Typically, numerical values would be either stored in a database or cumulated to create reports, dashboards, etc..
TECHNICAL DETAILS
The SDK consists of a Windows DLL available in both 32 bits and 64 bits versions. The DLL is multi-thread safe, allowing text quantification of multiple documents concurrently. It also supports the simultaneous application of multiple categorization and classification models, allowing one to perform several quantifications of the same documents.
The SDK comes with a sample project with source files illustrating how integration can be achieved. This sample project is currently available in Delphi, C# and VB.NET. Please contact us if you need assistance on how to use the SDK with other computer languages.