Nova postagem

Pesquisar

Artigo
· Set. 21, 2023 1min de leitura

Getting started with the Jupyter Server Proxy extension for VS Code

Earlier this year I announced availability of a VS Code extension for coding in ObjectScript, Embedded Python or SQL using the notebook paradigm popularized by Jupyter. Today I published a maintenance release to correct a "getting started" problem.

Here's a video of the installation steps from the extension's README:

(video superseded by an update in a later comment)

Why not try it for yourself?

4 Comments
Discussão (4)3
Entre ou crie uma conta para continuar
Artigo
· Set. 21, 2023 7min de leitura

Step by step guide to create customized chatbot by using spaCy (Python NLP library)

image

Hi Community,
In this article, I will demonstrate below steps to create your own chatbot by using spaCy (spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython):

  • Step1: Install required libraries

  • Step2: Create patterns and responses file

  • Step3: Train the Model

  • Step4: Create ChatBot Application based on the trained model

So Let us start.

Step1: Install required libraries

First of all, we need to install the required python libraries, which we can achieve by running the below command:

pip3 install spacy nltk
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Step2: Create patterns and responses file

We need to create intents.json file containing question patterns and responses, below is an example of some of the question patterns and responses

{
  "intents": [
    {
      "tag": "greeting",
      "patterns": [
        "Hi",
        "Hey",
        "How are you",
        "Is anyone there?",
        "Hello",
        "Good day"
      ],
      "responses": [
        "Hey :-)",
        "Hello, thanks for visiting",
        "Hi there, what can I do for you?",
        "Hi there, how can I help?"
      ]
    },
    {
      "tag": "goodbye",
      "patterns": ["Bye", "See you later", "Goodbye"],
      "responses": [
        "See you later, thanks for visiting",
        "Have a nice day",
        "Bye! Come back again soon."
      ]
    },
    {
      "tag": "thanks",
      "patterns": ["Thanks", "Thank you", "That's helpful", "Thank's a lot!"],
      "responses": ["Happy to help!", "Any time!", "My pleasure"]
    },
    {
      "tag": "items",
      "patterns": [
        "tell me about this app",
        "What kinds of technology used?",
        "What do you have?"
      ],
      "responses": [
        "Write something about the app."
      ]
    },        
    {
      "tag": "funny",
      "patterns": [
        "Tell me a joke!",
        "Tell me something funny!",
        "Do you know a joke?"
      ],
      "responses": [
        "Why did the hipster burn his mouth? He drank the coffee before it was cool.",
        "What did the buffalo say when his son left for college? Bison."
      ]
    }
  ]
}

Step3: Train the Model

3.1- Create a NeuralNet model.py file which will be used to train the model

#model.py file
import torch.nn as nn

class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.l1 = nn.Linear(input_size, hidden_size) 
        self.l2 = nn.Linear(hidden_size, hidden_size) 
        self.l3 = nn.Linear(hidden_size, num_classes)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        out = self.l1(x)
        out = self.relu(out)
        out = self.l2(out)
        out = self.relu(out)
        out = self.l3(out)
        # no activation and no softmax at the end
        return out

3.2 - We need nltk_utils.py file which uses "nltk" python library (a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP)). 

#nltk_utils.py
#nltk library used for building Python programs that work with human language data for applying in 
#statistical natural language processing (NLP)
import numpy as np
import nltk
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

def tokenize(sentence):
    """
    split sentence into array of words/tokens
    a token can be a word or punctuation character, or number
    """
    return nltk.word_tokenize(sentence)


def stem(word):
    """
    stemming = find the root form of the word
    examples:
    words = ["organize", "organizes", "organizing"]
    words = [stem(w) for w in words]
    -> ["organ", "organ", "organ"]
    """
    return stemmer.stem(word.lower())


def bag_of_words(tokenized_sentence, words):
    """
    return bag of words array:
    1 for each known word that exists in the sentence, 0 otherwise
    example:
    sentence = ["hello", "how", "are", "you"]
    words = ["hi", "hello", "I", "you", "bye", "thank", "cool"]
    bog   = [  0 ,    1 ,    0 ,   1 ,    0 ,    0 ,      0]
    """
    # stem each word
    sentence_words = [stem(word) for word in tokenized_sentence]
    # initialize bag with 0 for each word
    bag = np.zeros(len(words), dtype=np.float32)
    for idx, w in enumerate(words):
        if w in sentence_words: 
            bag[idx] = 1
    return bag

3.3 - We need to train the model, We will use the train.py file to create our model

#train.py
import numpy as np
import json,torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from nltk_utils import bag_of_words, tokenize, stem
from model import NeuralNet
 
with open('intents.json', 'r') as f:
    intents = json.load(f)

all_words = []
tags = []
xy = []
# loop through each sentence in our intents patterns
for intent in intents['intents']:
    tag = intent['tag']
    # add to tag list
    tags.append(tag)
    for pattern in intent['patterns']:
        # tokenize each word in the sentence
        w = tokenize(pattern)
        # add to our words list
        all_words.extend(w)
        # add to xy pair
        xy.append((w, tag))

# stem and lower each word
ignore_words = ['?', '.', '!']
all_words = [stem(w) for w in all_words if w not in ignore_words]
# remove duplicates and sort
all_words = sorted(set(all_words))
tags = sorted(set(tags))

print(len(xy), "patterns")
print(len(tags), "tags:", tags)
print(len(all_words), "unique stemmed words:", all_words)

# create training data
X_train = []
y_train = []
for (pattern_sentence, tag) in xy:
    # X: bag of words for each pattern_sentence
    bag = bag_of_words(pattern_sentence, all_words)
    X_train.append(bag)
    # y: PyTorch CrossEntropyLoss needs only class labels, not one-hot
    label = tags.index(tag)
    y_train.append(label)

X_train = np.array(X_train)
y_train = np.array(y_train)

# Hyper-parameters 
num_epochs = 1000
batch_size = 8
learning_rate = 0.001
input_size = len(X_train[0])
hidden_size = 8
output_size = len(tags)
print(input_size, output_size)

class ChatDataset(Dataset):
    def __init__(self):
        self.n_samples = len(X_train)
        self.x_data = X_train
        self.y_data = y_train

    # support indexing such that dataset[i] can be used to get i-th sample
    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]

    # we can call len(dataset) to return the size
    def __len__(self):
        return self.n_samples

dataset = ChatDataset()
train_loader = DataLoader(dataset=dataset,
                          batch_size=batch_size,
                          shuffle=True,
                          num_workers=0)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = NeuralNet(input_size, hidden_size, output_size).to(device)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Train the model
for epoch in range(num_epochs):
    for (words, labels) in train_loader:
        words = words.to(device)
        labels = labels.to(dtype=torch.long).to(device)        
        # Forward pass
        outputs = model(words)
        # if y would be one-hot, we must apply
        # labels = torch.max(labels, 1)[1]
        loss = criterion(outputs, labels)        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()        
    if (epoch+1) % 100 == 0:
        print (f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print(f'final loss: {loss.item():.4f}')
data = {
"model_state": model.state_dict(),
"input_size": input_size,
"hidden_size": hidden_size,
"output_size": output_size,
"all_words": all_words,
"tags": tags
}
FILE = "chatdata.pth"
torch.save(data, FILE)
print(f'training complete. file saved to {FILE}')

3.4 - Run the above train.py file to train the model

Train.py file will create chatdata.pth file which will be used in our chat.py file

Step4: Create ChatBot Application based on the trained model

As Our model is created. In the final step, we will create a chat.py file which we can use in our chatbot.

#chat.py
import os,random,json,torch
from model import NeuralNet
from nltk_utils import bag_of_words, tokenize

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

with open("intents.json") as json_data:
    intents = json.load(json_data)

data_dir = os.path.join(os.path.dirname(__file__))
FILE = os.path.join(data_dir, 'chatdata.pth')
data = torch.load(FILE)

input_size = data["input_size"]
hidden_size = data["hidden_size"]
output_size = data["output_size"]
all_words = data['all_words']
tags = data['tags']
model_state = data["model_state"]

model = NeuralNet(input_size, hidden_size, output_size).to(device)
model.load_state_dict(model_state)
model.eval()

bot_name = "iris-NLP"
def get_response(msg):
    sentence = tokenize(msg)
    X = bag_of_words(sentence, all_words)
    X = X.reshape(1, X.shape[0])
    X = torch.from_numpy(X).to(device)

    output = model(X)
    _, predicted = torch.max(output, dim=1)

    tag = tags[predicted.item()]

    probs = torch.softmax(output, dim=1)
    prob = probs[0][predicted.item()]
    if prob.item() > 0.75:
        for intent in intents['intents']:
            if tag == intent["tag"]:
                return random.choice(intent['responses'])
    return "I do not understand..."
if __name__ == "__main__":
    print("Let's chat! (type 'quit' to exit)")
    while True:
        sentence = input("You: ")
        if sentence == "quit":
            break
        resp = get_response(sentence)
        print(resp)

Run chat.py file and ask some questions which already defined above  in our intents.json file


For more details please visit IRIS-GenLab open exchange application page

Thanks

1 Comment
Discussão (1)1
Entre ou crie uma conta para continuar
Artigo
· Set. 21, 2023 1min de leitura

New application iris-size-django on OpenExchange

Why to use it

This app offers an easy interface to analyze storage:

  • Filter by database (namespace), global name, used size, or allocated size;
  • View a sum of the used and allocated sizes for the filters applied;
  • Export the table to JSON, CSV, or XML.

 

How to use it

Follow the instructions on the README file from the GitHub repository, and configure the settings to connect to your instance.

 

How to adapt it to your needs

Since it was built with Python and Django, you can easily add specific analysis methods in the api/methods.py and use the views on views.py to display them.

Also, if you want to save the analyzed data in a local instance but analyze an external one, you can set the settings.py file to the local, one and use iris.connect("host:port/NAMESPACE", "username", "password") on api/methods.py to the external instance.

 

More information

You can check my article series A portal to manage storage made with Django for a complete walkthrough of the development process of this application.

Also, feel free to contact me!

Discussão (0)1
Entre ou crie uma conta para continuar
InterSystems Oficial
· Set. 21, 2023

Deprecation of InterSystems IRIS NLP, formerly known as iKnow

InterSystems has decided to stop further development of the InterSystems IRIS Natural Language Processing, formerly known as iKnow, technology and label it as deprecated as of the 2023.3 release of InterSystems IRIS. InterSystems will continue to support existing customers using the technology, but does not recommend starting new development projects outside of the core text exploration use cases it was originally designed for. Other use cases involving natural language are increasingly well-served using novel techniques based on Large Language Models, an area InterSystems is also investigating in the context of specific applications.

Customers with questions on their current or planned use of InterSystems IRIS NLP are invited to reach out to their account team, or get in touch with @Benjamin De Boe 

The open-source version of the core iKnow engine, packaged as a Python module, can be used independently of InterSystems IRIS and will continue to be available.

Please note the InterSystems IRIS SQL Search feature, also known as iFind, is only partially affected. Only the Semantic and Analytic index types make use of the iKnow engine and therefore are deprecated. All other functionality and index types are not affected by this decision, and continue to be the recommended choice for applications requiring a flexible and high-performance full text search capability.

7 Comments
Discussão (7)4
Entre ou crie uma conta para continuar
Artigo
· Set. 18, 2023 6min de leitura

开发者作品展示--几乎实现的向量支持

如今,关于大语言模型、人工智能等的消息不绝于耳。向量数据库是其中的一部分,并且已经有非IRIS的技术实现了向量数据库。

为什么是向量?

  • 相似性搜索:向量可以进行高效的相似性搜索,例如在数据集中查找最相似的项目或文档。传统的关系数据库是为精确匹配搜索而设计的,不适合图像或文本相似性搜索等任务。
  • 灵活性:向量表示形式用途广泛,可以从各种数据类型派生,例如文本(通过 Word2Vec、BERT 等嵌入)、图像(通过深度学习模型)等。
  • 跨模态搜索:向量可以跨不同数据模态进行搜索。例如,给定图像的向量表示,人们可以在多模式数据库中搜索相似的图像或相关文本。

还有许多其他原因。

因此,对于这次 pyhon 竞赛,我决定尝试实现这种支持。不幸的是我没能及时完成它,下面我将解释原因。

有几件重要的事情必须完成,才能使其充实

  • 使用 SQL 接受并存储向量化数据,简单的示例(本例中的 3 是维度数量,每个字段都是固定的,并且该字段中的所有向量都必须具有精确的维度)
     create table items(embedding vector( 3 )); insert into items (embedding) values ( '[1,2,3]' ); insert into items (embedding) values ( '[4,5,6]' );
  • 相似度函数,相似度有不同的算法,适合对少量数据进行简单搜索,不使用索引
    -- Euclidean distance select embedding, vector.l2_distance(embedding, '[9,8,7]' ) distance from items order by distance; -- Cosine similarity select embedding, vector.cosine_distance(embedding, '[9,8,7]' ) distance from items order by distance; -- Inner product select embedding, -vector.inner_product(embedding, '[9,8,7]' ) distance from items order by distance;
  • 自定义索引,有助于更快地搜索大量数据,索引可以使用不同的算法,并使用与上面不同的距离函数,以及其他一些选项
    • 新南威尔士州
    • 倒排文件索引
  • 搜索将使用创建的索引,其算法将找到所请求的信息。

插入向量

该向量应该是一个数值数组,可以是整数或浮点数,也可以是有符号的或无符号的。在IRIS中我们可以将其存储为$listbuild,它具有良好的表示性,已经支持,只需要实现从ODBC到逻辑的转换。

然后,可以使用外部驱动程序(例如 ODBC/JDBC)或使用 ObjectScript 从 IRIS 内部以纯文本形式插入值

  • 普通 SQL
     insert into items (embedding) values ( '[1,2,3]' );
  • 来自ObjectScript
    set rs = ##class ( %SQL.Statement ). %ExecDirect (, "insert into test.items (embedding) values ('[1,2,3]')" ) set rs = ##class ( %SQL.Statement ). %ExecDirect (, "insert into test.items (embedding) values (?)" , $listbuild ( 2 , 3 , 4 ))
  • 或者嵌入式 SQL
     &sql( insert into test.items (embedding ) values ('[ 1 , 2 , 3 ]')) set val = $listbuild ( 2 , 3 , 4 ) &sql( insert into test.items (embedding ) values (:val))

它将始终存储为 $lb(),并在 ODBC 中以文本格式返回

 
意外行为

计算

我们需要向量来支持两个向量之间距离的计算

为了这次比赛,我需要使用嵌入式Python,这就带来了一个问题,如何在嵌入式Python中操作$lb。 %SYS.Class中有一个方法ToList,但Python包IRIS没有内置该方法,需要通过ObjectScript方式调用它

ClassMethod l2DistancePy(v1 As dc.vector.type, v2 As dc.vector.type) As %Decimal (SCALE= 10 ) [ Language = python, SqlName = l2_distance_py, SqlProc ] { import iris import math vector_type = iris.cls('dc.vector.type') v1 = iris.cls(' %SYS.Python ').ToList(vector_type.Normalize(v1)) v2 = iris.cls(' %SYS.Python ').ToList(vector_type.Normalize(v2)) return math.sqrt(sum([(val1 - val2) ** 2 for val1, val2 in zip(v1, v2)])) }

它看起来一点也不正确。我希望 $lb 可以在 python 中即时解释为列表,或者在列表内置函数 to_list 和 from_list 中解释

另一个问题是当我尝试使用不同的方式测试此功能时。使用嵌入式Python中的SQL,使用嵌入式Python编写的SQL函数,它会崩溃。因此,我还必须添加 ObjectScript 的功能。

ModuleNotFoundError: No module named 'dc'
SQL Function VECTOR.NORM_PY failed with error:  SQLCODE=-400,%msg=ERROR #5002: ObjectScript error: <OBJECT DISPATCH>%0AmBm3l0tudf^%sqlcq.USER.cls37.1 *python object not found

目前在 Python 和 ObjectScript 中实现了计算距离的函数

  • 欧氏距离
    [SQL]_system@localhost:USER> select embedding, vector.l2_distance_py(embedding, '[9,8,7]' ) distance from items order by distance; + -----------+----------------------+ | embedding | distance | + -----------+----------------------+ | [4,5,6] | 5.91607978309961613 | | [1,2,3] | 10.77032961426900748 | + -----------+----------------------+ 2 rows in set Time : 0.011 s [ SQL ]_system@localhost: USER > select embedding, vector.l2_distance(embedding, '[9,8,7]' ) distance from items order by distance; + -----------+----------------------+ | embedding | distance | + -----------+----------------------+ | [4,5,6] | 5.916079783099616045 | | [1,2,3] | 10.77032961426900807 | + -----------+----------------------+ 2 rows in set Time : 0.012 s
  • 余弦相似度
    [SQL]_system@localhost:USER> select embedding, vector.cosine_distance(embedding, '[9,8,7]' ) distance from items order by distance; + -----------+---------------------+ | embedding | distance | + -----------+---------------------+ | [4,5,6] | .034536677566264152 | | [1,2,3] | .11734101007866331 | + -----------+---------------------+ 2 rows in set Time : 0.034 s [ SQL ]_system@localhost: USER > select embedding, vector.cosine_distance_py(embedding, '[9,8,7]' ) distance from items order by distance; + -----------+-----------------------+ | embedding | distance | + -----------+-----------------------+ | [4,5,6] | .03453667756626421781 | | [1,2,3] | .1173410100786632659 | + -----------+-----------------------+ 2 rows in set Time : 0.025 s
  • 内积
    [SQL]_system@localhost:USER> select embedding, vector.inner_product_py(embedding, '[9,8,7]' ) distance from items order by distance; + -----------+----------+ | embedding | distance | + -----------+----------+ | [1,2,3] | 46 | | [4,5,6] | 118 | + -----------+----------+ 2 rows in set Time : 0.035 s [ SQL ]_system@localhost: USER > select embedding, vector.inner_product(embedding, '[9,8,7]' ) distance from items order by distance; + -----------+----------+ | embedding | distance | + -----------+----------+ | [1,2,3] | 46 | | [4,5,6] | 118 | + -----------+----------+ 2 rows in set Time : 0.032 s

另外还实现了数学函数:add、sub、div、mul。 InterSystems 支持创建自己的聚合函数。因此,可以对所有向量求和或求平均值。但不幸的是,InterSystems 不支持使用相同的名称,需要使用自己的名称(和模式)来执行函数。但它不支持聚合函数的非数值结果

简单的 vector_add 函数,返回两个矢量的和

当用作聚合时,它显示 0,而预期矢量也是

建立索引

不幸的是,由于我在实现过程中遇到了一些障碍,我没能完成这一部分。

  • 缺乏内置的 $lb 到 python 列表转换以及当 IRIS 中的矢量存储在 $lb 中时返回,并且所有具有构建索引的逻辑预计都在 Python 中,从 $lb 获取数据并将其设置回全局变量也很重要
  • 缺乏对Global的支持
    • IRIS 中的 $Order,支持方向,因此可以反向使用,而Python内嵌的 order 实现没有它,因此需要读取所有键并反转它们或将末尾存储在某处
  • 由于对上面提到的从 Python 调用的 Python 的 SQL 函数的不好的体验而产生疑问
  • 在构建索引期间,预计会在图形中存储矢量之间的距离,但在global里保存浮点数时遇到了bug

我在工作中发现了11 个嵌入式 Python 问题,所以大部分时间都是在寻找解决方法来解决问题。在名为iris-dollar-list@Guillaume Rongier 项目的帮助下,我成功解决了一些问题。

安装

无论如何,它仍然可用,并且可以与 IPM 一起安装,即使功能有限也可以使用

zpm "install vector"

或者在开发模式下使用 docker-compose

 git clone https://github.com/caretdev/iris-vector.git cd iris-vector docker-compose up -d
Discussão (0)0
Entre ou crie uma conta para continuar