Nova postagem

Encontrar

Artigo
· jan 12, 2024 7min de leitura

Textual Similarity Comparison using IRIS, Python, and Sentence Transformers

With the advent of Embedded Python, a myriad of use cases are now possible from within IRIS directly using Python libraries for more complex operations. One such operation is the use of natural language processing tools such as textual similarity comparison.

 

Setting up Embedded Python to Use the Sentence Transformers Library

Note: For this article, I will be using a Linux system with IRIS installed. Some of the processes for using Embedded Python with Windows, such as installing libraries, may be a bit different from Linux to Windows so please refer to the IRIS documentation for the proper procedures when using Windows. With the libraries installed, however, the IRIS code and setup should be the same between Linux and Windows. This article will use Redhat Linux 9 and IRIS 2023.1.

The Python library used for this article will be Sentence Transformers (https://github.com/UKPLab/sentence-transformers).To be able to use any Python library from within IRIS, it must be installed in such a way that IRIS can make use of it. In Linux, this can be done by using the standard Python library installation command but with a target of the IRIS instance's Python directory. Note that for Sentence Transformers, your system must also already have a Rust compiler installed as a pre-requisite for one of the library dependencies installed alongside sentence-transformers (https://www.rust-lang.org/tools/install).

sudo python -m pip install sentence-transformers --target [IRIS install directory]/sys/mgr/python

Once installed, Sentence Transformers can be used directly within the IRIS Python shell

 

Text Comparison Using the Sentence Transformers Library

In order to carry out a comparison of two separate texts, we first need to generate the embeddings for each text block. An embedding is a numerical representation of the text based on the linguistic construction of the text using the given language model. For Sentence Transformers, we can use several different pre-trained language models (https://www.sbert.net/docs/pretrained_models.html). In this case, we'll use the "all-MiniLM-L6-v2" model which is listed as an "All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs." and should be sufficient for this article. There are many other pre-trained models to choose from, however, including some multilingual models that can detect and make use of many different languages so choose the model that works best for your specific use case.

The code used above is how we can generate the embeddings, which will convert the given string of text into a Python vector that can then be used in the comparison function.

Once the embeddings for each text block have been created, these can then be used in the cosine_similarity function to arrive at a value from 0 to 1 representing the relative similarity between the given texts.

The above example shows that the two text strings given are roughly 64% similar based on their linguistic structure.

 

Creating and Storing Text Embeddings using IRIS Embedded Python and Sentence Transformers

Using this same process, we can incorporate this process into an IRIS class and store the embeddings for future use, such as comparing an existing text to a newly entered text.

Create a new class in IRIS that extends %Persistent so we can store data against the subsequent table.

Create two string properties on the class, one for the text itself and another to store the embedding data. It is important to include the MAXLEN = 100000 on the Embedding field, otherwise the embedding data will not fit in the field and the comparison will fail. The MAXLEN on the Text field will depend on your use case for how much text may be stored.

Class TextSimilarity.TextSimilarity Extends %Persistent
{
    Property Text As %String(MAXLEN = 100000);
    Property Embedding As %String(MAXLEN = 100000);
}

Note: In the current version of IRIS as of writing this article (IRIS 2023), the embedding data must be converted from the tensor object, a form of Python vector, to a string for storage since no current native datatypes are compatible with the Python vector. In the near future (expected IRIS 2024.1 as of writing this article), a native vector datatype is being added to IRIS that will allow storage of Python vectors directly within IRIS without the need for string conversion. 

Create a new ClassMethod that takes in a string, creates the embedding, converts the embedding to a string value, and then returns the resulting string. Use the method decorator as shown below to indicate to IRIS that this method will be a native Python method rather than ObjectScript.

ClassMethod GetTextEmbedding(text As %String) As %String [ Language = python ]
{
    from sentence_transformers import SentenceTransformer
    from pickle import dumps

    model = SentenceTransformer('all-MiniLM-L6-v2')
    embedding = model.encode(text, convert_to_tensor=True)
    embeddingList = embedding.tolist()
    pickledObj = dumps(embeddingList, 0)
    pickledStr = pickledObj.decode('utf-8') 
    return pickledStr
}

In the method above, the first part of the code is similar to what was done in the Python shell to create the embedding. Once the tensor object has been created from model.encode, the conversion to a string proceeds as:

  1. Convert the tensor object obtained from model.encode to a Python list object
  2. Convert the list object to a Python pickled object using the dumps function from the built-in pickle library
    1. Note: Make sure to include the 0 parameter as shown above as it forces the dumps function to use an older method of encoding that works properly with UTF-8 encoding
  3. Decode the pickled object to a UTF-8 encoded string

Now, the tensor object has been fully converted to a string and can be returned to the calling function.

Next, create a ClassMethod that will use ObjectScript to bring in the text, setup the new object for database storage, call the GetTextEmbedding method, then store the text and embedding to the database.

ClassMethod NewText(text As %String)
{
    set newTextObj = ..%New()
    set newTextObj.Text = text
    set newTextObj.Embedding = ..GetTextEmbedding(text)
    set status = newTextObj.%Save()
    write status
}

Now we have a way to send a text string into the class, have it create the embeddings, and store both to the database.

 

Comparing Text Using IRIS Embedded Python and Sentence Transformers

Moving on to the comparison process, I will again create a Python class method to handle the actual conversion of the embedding string back to a tensor and run the comparison along with an ObjectScript class method to handle retrieving the data from the database and calling the Python similarity method.

First, the Python similarity method.

ClassMethod GetSimilarity(embed1 As %String, embed2 As %String) As %String [ Language = python ]
{
    from sentence_transformers import util
    from pickle import loads
    from torch import tensor

    embed1Bytes = bytes(embed1, 'utf-8')
    embed1List = loads(embed1Bytes)

    embed2Bytes = bytes(embed2, 'utf-8')
    embed2List = loads(embed2Bytes)

    tensor1 = tensor(embed1List)
    tensor2 = tensor(embed2List)

    cosineScores = util.cos_sim(tensor1, tensor2)

    return str(abs(cosineScores.tolist()[0][0]))
}

In the above method, each embedding is converted to a byte string using the bytes function (using UTF-8 again since that is what was used to decode them in the encoding conversion process), and then the loads function from the pickle library will take the byte string and convert it to a Python list object. Finally, the tensor function from the torch library will complete the conversion process by converting the Python list into a proper tensor object, ready to be used in the cosine similarity comparison.

Then, the cos_sim function from the util library of Sentence Transformers is called to compare the tensors and return the similarity score. This return, however, will be formatted as a tensor as well so it will need to be converted to a list and dereferenced to the first and only element in the list]. Then, due to the geometric nature of the cosine similarity, take the absolute value and finally convert the resulting decimal value to a string to pass back to the calling function.

From here, we can then create the ObjectScript function to handle the database operations to retrieve the specific text objects and call the similarity function.

ClassMethod Compare(id1 As %Integer, id2 As %Integer)
{
    set obj1 = ..%OpenId(id1)
    set text1 = obj1.Text
    set embedding1 = obj1.Embedding

    set obj2 = ..%OpenId(id2)
    set text2 = obj2.Text
    set embedding2 = obj2.Embedding

    set sim = ..GetSimilarity(embedding1, embedding2)

    write !,"Text 1:",!,text1,!
    write !,"Text 2:",!,text2,!
    write !,"Similarity: ",sim
}

This method will open each object using the provided row ID, get the text and embedding data, pass the embeddings to the GetSimilarity method to get the similarity score, and then write out each text and the similarity.

 

Testing

To test, I will use the synopsis of the movie Hot Fuzz from two different sources (IMDB and Wikipedia) to see how similar they are. If everything is working as expected, the similarity should be sufficiently high score. 

First, I'll store each text object to the database using the NewText function:

Once stored, the data should look like this (ignore the IDs starting at 3, I had a bug in my initial code and had to redo the save).

Now that our text data has been stored along with it's associated embedding, we can call the Compare function with IDs 3 and 4 to get the final comparison score:

As a result, we see that the similarity between the IMDB and Wikipedia synopses for Hot Fuzz are about 75% similar, which I would consider to be accurate given how much more text there is in the IMDB text.

 

Closing

This is just one of the many functions available in Sentence Transformers, plus there are numerous other NLP toolkits out there that can carry out various functions relating to text and language analysis. The point of this article is to show that if you can do it in Python, you can do it in IRIS as well using Embedded Python.

4 Comments
Discussão (4)3
Entre ou crie uma conta para continuar
Pergunta
· jan 11, 2024

$ZF question - Calling scripts and redirecting output from script

I am trying to write a ZMIRROR routine that makes a shell script call using $ZF

     Set cmd = "/usr/local/sbin/failover-intengtest-vip"
     Do $ZF(-100,"/ASYNC /SHELL",cmd)

The script I am calling is returning an output to the screen, how do I get around this using $ZF without having to rewrite the scripts?

Thanks

Scott

2 Comments
Discussão (2)3
Entre ou crie uma conta para continuar
Artigo
· jan 10, 2024 5min de leitura

Enjoy checking InterSystems IRIS performance with a useful tool ^mypButtons

[Background]

InterSystems IRIS family has a nice utility ^SystemPerformance (as known as ^pButtons in Caché and Ensemble) which outputs the database performance information into a readable HTML file. When you run ^SystemPerformance on IRIS for Windows, a HTML file is created where both our own performance log mgstat and Windows performance log are included.

^SystemPeformance generates a great report, however, you need to extract log sections manually from a HTML file and paste them to a spreadsheet editor like Excel to create a performance visual graph. Many developers already share useful tips and utilities to do it here (This is a Developer Community great article by  @Murray.Oldfield )

Now I introduce a new utiltiy ^mypButtons!

 

[What's new compared to other tools]

Download mypButtons.mac from OpenExchange.

  • ^mypButtons combines mgstat and Windows performnace logs in one line. For instance, you can create a graph includes both "PhyWrs" (mgstat) and "Disk Writes/sec" (Win perfmon) in the same time frame.
  • ^mypButtons reads multiple HTML files at once and generates a single combined CSV file.
  • ^mypButtons generates a single CSV file into your laptop so it's much easier to crete your graph as you like.
  • ^mypButtons generates a CSV and it includes columns which I strongly recommend to check as the first step to see the performance of InterSystems product. So everyone can enjoy a peformance graph with this utility so easily!

Please Note! If you want to play mypButtons.csv, please load SystemPerformance HTML files with "every 1 second" profile.

 

[How to run]

do readone^mypButtons("C:\temp\dir\myserver_IRIS_20230522_130000_8hours.html","^||naka")

It reads one SystemPerformance HTML file and store the information into a given global. In this sample, it reads myserver_IRIS_20230522_130000_8hours.html and store it into ^||naka.

do readdir^mypButtons("C:\temp\dir","^||naka")

It reads all of SystemPerformance HTML files under a given folder and store the information into a given global. In this sample, it reads all HTML files under C:\temp\dir and store it into ^||naka.

do writecsv^mypButtons("C:\temp\csv","^||naka")

It generates the following three csv files under a given folder from a given global.

  • mgstat.csv
  • perfmon.csv
  • mypButtons.csv

Here, mypButtons.csv includes the following columns by default, which I strongly recommend to check first to see the performance:

  • mgstat: Glorefs, PhyRds, Gloupds, PhyWrs, WDQsz, WDphase
  • perfmon: Available MBytes, Disk Reads/sec, Disk Writes/sec, % Processor Time

This utility works for InterSystems IRIS, InterSystems IRIS for Health, Caché and Ensemble for Windows.

 

[Example steps to create your IRIS server's performance graph with ^mypButtons]

(1) First, run ^SystemPerformance to record both our own performance tool mgstat and Windows peformance monitor perfmon. By default, InterSystems IRIS has some profiles so you can enjoy it soon. Try this from IRIS terminal.

%SYSdo ^SystemPerformance
Current log directory: c:\intersystems\iris\mgr\
Windows Perfmon data will be left in raw format.
Available profiles:
  1 12hours - 12-hour run sampling every 10 seconds
  2 24hours - 24-hour run sampling every 10 seconds
  3 30mins - 30-minute run sampling every 1 second
  4 4hours - 4-hour run sampling every 5 seconds
  5 8hours - 8-hour run sampling every 10 seconds
  6 test - 5-minute TEST run sampling every 30 seconds
select profile number to run: 3

Please Note! If you want to play mypButtons.csv, please use "every 1 second" profile. By default, you will see "30 mins" profile which samples every 1 second. If you want to create another profiles, see our documentation for more details.

(2) After sampling, one HTML will be generated under irisdir\mgr, whose name is like JP7320NAKAHASH_IRIS_20231115_100708_30mins.html. Open a generated HTML, and you will see a lot of performance comma separated data under mgstat and perfmon section. 

 

(3) Load it with ^mypButtons as below.

USER> do readone^mypButtons("C:\InterSystems\IRIS\mgr\JP7320NAKAHASH_IRIS_20231115_100708_30mins.html","^||naka")

This will load HTML in the first parameter and save the performance data into the global in the second parameter.

(4) Generate CSV witl ^mypButtons as below.

USER> do writecsv^mypButtons("C:\temp","^||naka")

This will output three CSV files under the folder in the first parameter from the global in the second parameter. Open mypButtons.csv in the excel, and you can see mgstat and perfmon is in the same line every second. See this screenshot - yellow highlighted columns are mgstat and blue highlighted columns are perfmon.

 

(5) Let's create a simple graph from this CSV. It's so easy. Choose column B Time and column C Glorefs, select Insert menu, 2-D Line graphs as below. 

 

This graph will show you "Global Refernce numbers per second" information. Sorry, there were very few activities in my IRIS instance so my sample graph does not excite you, but I do believe this graph from the production server will tell you a lot of useful information!

 

(6) mypButtons.csv includes selected columns which I think you should check first. Murray's article series will tell you why these columns are important to see the performance.

 

[Edit ^mypButtons for reporting columns]

If you want to change columns which are reported into mypButtons.csv, please modify writecsv label manually. It reports columns which are defined in this area. 

 

 

I hope my article and utility will make you encourage to check the performance of InterSystems IRIS. Happy SystemPeformance 😆

2 Comments
Discussão (2)1
Entre ou crie uma conta para continuar
Pergunta
· jan 6, 2024

ERROR #7602: how can I exported?

Dear experts,

I'm trying exported a project from version '2012.5' to '2018.1.4' but is returning ERROR #7602.
Please, could you help me. How can I do it?
Below is steps that I did:

ACB>D $System.OBJ.Load("C:\GlobalPatch.xml")
 
Load started on 01/07/2024 00:06:51
Loading file C:\GlobalPatch.xml as xml
Imported global: ^GlobalPatch
Load finished successfully.
 
ACB>Set sc = ##class(%Studio.Project).InstallFromGbl("^GlobalPatch","fv")
ERROR #7602: Exported on version '2012.5' but this machine on version '2018.1.4' so unable to import

Thank you.

3 Comments
Discussão (3)2
Entre ou crie uma conta para continuar
Anúncio
· jan 5, 2024

[Video] North West London Integrated Care System Go Live with Health Connect Cloud

Hi Developers,

Start watching our latest video on InterSystems Developers YouTube:

⏯ North West London Integrated Care System Go Live with Health Connect Cloud @ Global Summit 2023

Learn about HealthShare Health Connect Cloud (HCC), a new service from InterSystems, and how it helps customers move integration to the cloud, fully managed by InterSystems. You'll also learn how a large customer in the United Kingdom has gone live with Health Connect Cloud, the benefits it has realized, and the lessons it has learned.

Presenters:
🗣 Matt Kybert, Deputy CIO, North West London Integrated Care System, National Health Service
🗣 @Mark Massias, Senior Sales Engineer, InterSystems

Enjoy watching it and look out for more videos! 👍

Discussão (0)1
Entre ou crie uma conta para continuar