DataChef Hiring Pipeline

(Feel free to make this page your own. The image is taken from here).

Online Courses & Materials The Good Parts of AWS Part 1 AWS Certified Cloud Practitioner 2020 Objectives Create A Billing Alarm- LAB —> CloudWatch Let's Start To Clound! Identity Access Management (IAM)- LAB IAM Best Practices IAM Credential Reports S3 101 Let's Create an S3 Bucket!- LAB Let's Create an S3 Website! - LAB S3 versioning Packt- Hands-On Machine Learning Using Amazon SageMaker Your First Machine Learning Model on SageMaker Word2Vec Papers Efficient Estimation of Word Representations in Vector Space- T Mikolov 2013 Distributed Representations of Words and Phrases and their Compositionality- T Mikolov 2013 Video Lecture Implementation Problem Statement CarNext Products Dense Representation Learning Proposal Graph Convolutional Matrix Completion Paper Introduction Matrix completion as link prediction in bipartite graphs Implementation Introduction /src train inference Dockerfile executor.sh build.sh push.sh local_train.sh local_deploy.sh /local_test Train and Serve the Model on AWS SageMaker Next Steps Refactoring the code Handling the zero class problem Productionizing the code Creating a blueprint for ML projects

Online Courses & Materials

The Good Parts of AWS

Part 1

Searching for the optimal option is almost always expensive.

Instead of searching for the best option, we recommend a technique we call the default heuristic. The premise of this heuristic is that when the cost of acquiring new information is high and the consequence of deviating from a default choice is low, sticking with the default will likely be the optimal choice.

A default choice is any option that gives you very high confidence that it will work.

Your default choice doesn’t have to be the theoretical best choice. It doesn’t have to be the most efficient. Or the latest and greatest. Your default choice simply needs to be a reliable option to get you to your ultimate desired outcome. Your default choice should be very unlikely to fail you; you have to be confident that it’s a very safe bet. In fact, that’s the only requirement.

You would only deviate from your defaults if you realize you absolutely have to.

If you’re storing data—whatever it is—S3 should be the very first thing to consider using.

It is highly-durable, very easy to use and, for all practical purposes, it has infinite bandwidth and infinite storage space.

It is also one of the few AWS services that requires absolutely zero capacity management.

Fundamentally, you can think of S3 as a highly-durable hash table in the cloud. The key can be any string and the value of any blob of data up to 5 TB.

S3 storage costs $23.55 / TB / month using the default storage class.

If you’re touching S3 objects at a high frequency (millions of times a day), request pricing becomes an important aspect of S3’s viability for your use case.

One limitation of S3 is that you cannot append to objects.If you have something that’s changing rapidly (such as alog file), you have to buffer updates on your side for a while, and then periodically flush chunks to S3 as new objects.

This buffering can reduce your overall data durability, because the buffered data will typically sit in a single place without any replication. A solution for this issue is to buffer data in a durable queue, such as SQS or Kinesis streams, as we’ll see later.

Not good for hosting a static website.

Bucket names are globally unique across all AWS customers and across al lAWS regions. A common mitigation is to always add your AWS account ID to the bucket name,which makes conflicts much less likely.

S3 has an API to check if you own the bucket, and this should always be done before interacting with an existing S3 bucket.

2. EC2

EC2 allows you to get a complete computer in the cloud in a matter of seconds. The nice thing about EC2 is that the computer you get will be very similar to the computer you use to develop your software.

If you can run your software on your computer, you can almost certainly run it on EC2 without any changes. This is one of EC2’s main advantages compared to other types of compute platforms (such as Lambda): you don’t have to adapt your application to your host.

As of the time of writing, EC2 offers 256 different instance types, but they can be narrowed down to a few categories defined by what they’re optimized for: CPU, memory, network, storage, etc., with different instance sizes foreach category.

One of the most compelling features of EC2 is that you only pay for the number of seconds your instance is running.

However, EC2 also offers you the option to commit to a long period in exchange for a price reduction.

There are several cost reduction plans for EC2.

3. EC2 Auto Scaling

Amazon will tell you that Auto Scaling allows you to automatically add or remove EC2 instances based on the fluctuating demands of your application.

As a thought experiment, consider if your EC2 bill were to go down by 30% — would that be a big deal for your business? If not, the effort and complexity of getting Auto Scaling working properly are probably not going to be worth it.

The other thing to consider is: does your EC2 demand vary enough for Auto Scaling to even matter? If the fluctuations are not significant, or they are too abrupt, or they are not very smooth, Auto Scaling will almost certainly not work well for you.

Having said all that, you should still almost always use Auto Scaling if you’re using EC2! Even if you only have one instance.

The other nice thing that comes with Auto Scaling is the ability to simply add or remove instances just by updating the desired capacity setting.

4. Lambda

If EC2 is a complete computer in the cloud, Lambda is a code runner in the cloud.

With EC2 you get an operating system, a file system, access to the server’s hardware, etc.

But with Lambda, you just upload some code and Amazon runs it for you. The beauty of Lambda is that it’s the simplest way to run code in the cloud. It abstracts away everything except for a function interface, which you get to fill in with the code you want to run.

We think Lambda is great — definitely one of the good parts of AWS — as long as you treat it as the simple code runner that it is.

A problem we often see is that people sometimes mistake Lambda for a general-purpose application host. Unlike EC2, it is very hard to run a sophisticated piece of software on Lambda without making some very drastic changes to your application and accepting some significant new limitations from the platform.

Lambda is most suitable for small snippets of code that rarely change. We like to think of Lambda functions as part of the infrastructure rather than part of the application. In fact, one of our favorite uses for Lambda is to treat it as a plugin system for other AWS services.

For example, S3 doesn’t come with an API to resize an image after uploading it to a bucket, but with Lambda, you can add that capability to S3.

Application load balancers come with an API to respond with a fixed response for a given route, but they can’t respond with an image. Lambda lets you make your load balancer do that.

CloudFront can’t rewrite a request URL based on request cookies (which is useful for A/B testing), but with Lambda, you can make CloudFront do that with just a little bit of code.

CloudWatch doesn’t support regex-based alerting on application logs, but you can add that feature with a few lines of Lambda code.

Kinesis doesn’t come with an API to filter records and write them to DynamoDB, but this is very easy to do with Lambda.

CloudFormation’s native modeling language has many limitations and, for example, it can’t create and validate a new TLS certificate from the AWS Certificate Manager. Using Lambda, you can extend the CloudFormation language to add (almost) any capability you want.

Lambda is a great way to extend existing AWS features.

Treating Lambda as a general-purpose host for your applications is risky. It might look compelling at first — no servers to manage, no operating system to worry about,and no costs when unused—but Lambda’s limitations are insidious hidden risks that typically reveal themselves once your application evolves into something bigger.

Some limitations will likely improve or go away over time.

For example, a very annoying issue is the cold start when a function is invoked after a period of inactivity or when Lambda decides to start running your function on new backend workers.

Another problem is the limit of 250 MB for your code bundle, including all your dependencies.

And the network bandwidth from Lambda functions seems to be very limited and unpredictable.

But then there are other limitations that are inherent to the way Lambda works and which are less likely to go away.

For example, you have to assume that every Lambda invocation is stateless. If you need to access some state, you have to use something like S3 or DynamoDB.

While this works fine for a demo, it can quickly become prohibitively expensive in the real world. For example, handling a WebSocket connection on Lambda will likely require a read and write to DynamoDB for every exchanged packet, which can quickly result in aspect acularly large DynamoDB bill, even with modest activity.

Our rule of thumb is simple: If you have a small piece of code that will rarely need to be changed and that needs to run in response to something that happens in your AWS account, then Lambda is a very good default choice.

Lambda is certainly not a substitute for EC2.

5. CloudFormation

When using AWS, you almost always want to use some CloudFormation (or a similar tool). It lets you create and update the things you have in AWS without having to click around on the console or write fragile scripts.

It takes a while to get the hang of it, but the time savings pay off the initial investment almost immediately. Even for development, the ability to tear down everything cleanly and recreate your AWS setup in one click is extremely valuable.

With CloudFormation, you define your AWS resources as a YAML script (or JSON, but we find YAML to be much easier to read and modify).

Then you point CloudFormation to your AWS account, and it creates all the resources you defined.

Our rule of thumb is to let CloudFormation deal with all the AWS things that are either static or change very rarely; things such as VPC configurations, security groups, load balancers, deployment pipelines, and IAM roles.

Other things, such as DynamoDB tables, Kinesis streams, Auto Scaling settings, and sometimes S3 buckets,are better managed elsewhere.

6. SQS

SQS is a highly-durable queue in the cloud. You put messages on one end, and a consumer takes them out from the other side. The messages are consumed in almost first-in-first-out order, but the ordering is not strict.

7. Kinesis

You can think of a Kinesis stream as a highly-durable linked list in the cloud. The use cases for Kinesis are often similar to those of SQS — you would typically use either Kinesis or SQS when you want to enqueue records for asynchronous processing. The main difference between the two services is that SQS can only have one consumer, while Kinesis can have many.

AWS Certified Cloud Practitioner 2020

Objectives

Cloud Concepts

Security

Technology

Billing and Pricing

Main Services for a Cloud Practitioner

Compute —> EC2, Lambda

Storage —> Simple Storage Service (S3), Glacier

Databases —> Relational Database Service (RDS), DynamoDB (Non Relational Databases)

Migration & Transfer

Network & Content Delivery —> VPC, Route53

Security, Identity & Compliance

AWS Cost Management

Why Choosing the right AWS Region?

Data Sovereignty Laws

Latency to end users

AWS Services

Create A Billing Alarm- LAB —> CloudWatch

set a SNS (Simple Notification Service) for manageing the usage amount (for example billing)

CloudWatch —> Alarms —> Billing —> Create alarm —> set the limit, for example, greater than 10$ —> create a new top, e.g., Amirhossein_Billing_Alarm with my email —> confirm via email —> add some name and description —> create alarm.

Let's Start To Clound! Identity Access Management (IAM)- LAB

IAM stands for Identity Access Managements. It is Global, you do not specify a region when dealing with IAM. When you create a user or group, this is created GLOBALLY.

You can access the AWS platform in 3 ways: 1- Via the Console. 2- Programmatically (Using the Command Line). 3- Using the Software Developers Kit (SDK).

Your root account is the email address you used to set up your AWS account. The root account always has full administrator access. You should not give these account credentials away to anyone. Instead create a user for each individual within your organization. You should always secure this root account using multi-factor authentication.

A group is simply a place to store your users. Your users will inherit all permissions that the group has. Examples of group might be developers, system administrators, human resources, finance etc.

To set the permissions in a group you need to apply a policy to that group. Policies consist of JSON. These are referred to as key-value pairs.

IAM Best Practices

Root Account: Only use the root account to create your AWS account. Don not use it to log in.

Users: One user should equal one real human being. Don't create phantom users.

User/Groups/Policies: Always place users in groups, and then apply policies to the groups. This makes management easier.

Passwor Policies: Have a strong password rotation policy.

MFA: Always enable MFA wherever possible.

Roles: Use roles to access various other AWS services.

Access Keys: Use access keys for programmatic access to AWS.

IAM Credential Report: Use IAM credential reports to audit the permissions of your users/accounts.

IAM Credential Reports

You can generate and download a credential report that lists all users in your account: IAM —>credential report —> download

S3 101

S3 provides developers and IT teams with secure, durable, highly-scalable object storage. Amazon S3 is easy to use, with a simple web services interface to store and retrieve any amount of data from anywhere on the web.

S3 is a safe place to store your files.

It is Object-based storage —> i.e., allows you to upload files.

Objects consist of the following: 1- Key (This is simply the name of the object) 2- Value (This is simply the data and is made up of a sequences of bytes) 3- Version ID (Important for versioning) 4- Metadata (Data about data you are storing)

5- Subresources —> Access Control List, Torrent

The data is spread across multiple devices and facilities.

Files can be from 0 Bytes to 5TB.

There is unlimited storage.

Files are stored in Buckets.

S3 is a universal namespace. That is, names must be unique globally.

When you upload a file to S3, you will receive a HTTP 200 code if the upload was successful.

How does data consistency work for S3?

Read after Write consistency for PUTS of new Objects. —> If you write a new file and read it immediately afterwards, you will be able to view that data.

Eventual Consistency for overwrite PUTS and DELETES (can take some time to propagate) —> If you update an existing file or delete a file and read it immediately, you may get the older version, or you may not. Basically changes to objects can take a bit of time to propagate.

S3 has the following features:

Tiered Storage Available

Lifecycle Management

Versioning

Encryption

Secure your data using Access Control Lists (work on individual files) and Bucket Policies (across entire bucket)

You are charged for S3 in the following ways:

Storage

Requests

Storage Management Pricing

Data Transfer Pricing

Transfer Acceleration —> Amazon S3 Transfer Acceleration enables fast, easy, and secure transfer of files over long distances between your end users and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront's globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an otimized network path.

Cross Region Replication Pricing

Let's Create an S3 Bucket!- LAB

Bucket names share a common name space. You cannot have the same bucket name as someone else.

When you view your buckets you view them globally but you can have buckets in individual regions.

Restricting Bucket Access:

Bucket Policies- Applies across the whole bucket.

Object Policies - Applies to individual files.

IAM Policies to Users & Groups - Applies to Users & Groups.

Let's Create an S3 Website! - LAB

You can use bucket policies to make entire S3 buckets public.

You can use S3 to host STATIC websites (such as .html). Websites that require database connections such as Wordpress etc cannot hosted on S3.

S3 scales automatically to meet your demand. Many enterprises will put static websites on S3 when they think there is going to be a large number of requests (such as for a movie preview for example)

S3 versioning

Stores all versions of an object —> Including all writes and even if you delete an object.

Great backup tool

Versioning cannot be disabled —> Once enabled, versioning cannot be disabled (only suspended)

Integrates with lifecycle rules

Versioning MFA capability —> Uses multi-factor authentication; can be used to provide an additional layer of security.

Packt- Hands-On Machine Learning Using Amazon SageMaker

Your First Machine Learning Model on SageMaker

The Course Overview

AWS Setup

What Problem You Will Solve

Train The Model on SageMaker

Deploy The Model as a REST service on SageMaker

1- Setting up the AWS on the local machine:

Select Roles from the list in the left-hand side, and click on Create role

Then, select SageMaker.

Click Next: Review on the following page.

Type a name for the SageMaker role, i.e. PacktSageMaker, and click on Create role.

Click on the created role and, then, click on Attach policy and search for AmazonEC2ContainerRegistryFullAccess. Attach the corresponding policy.

Do the same to attach the AmazonS3FullAccess and IAMReadOnlyAccess policies, and end up with the following.

Now, go to Users page by clicking on Users on the left-hand side.

Click on Add user and type packt-sagemaker as username.

Copy the ARN of that user.

Then, go back the page of the Role you created and click on the Trust relationships tab.

Click on Edit trust relationship and add the following

You're almost there! Make sure that you have added the IAM user in your ~/.aws/credentials file. For example:

And, finally, add the following in the ~/.aws/config file:

Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables

2- Structure of the Project:

train.py

predictor.py

serve.py

wsgi.py

build_and_push.sh

Dockerfile

3- Train and deploy the model on SageMaker

Word2Vec

Papers

Efficient Estimation of Word Representations in Vector Space- T Mikolov 2013

For all the following models, the training complexity is proportional to:

where is the number of the training epochs, is the number of the words in the training set and is defined further for each model architecture. A common choice is and up to one billion. All models are trained using stochastic gradient descent and backpropagation.

2.1 Feedforward Neural Net Language Model (NNLM)

The probabilistic feedforward neural network language model has been proposed in [Y. Bengio, A neural probabilistic language mode].

It consistsof input, projection, hidden and output layers.

At the input layer, previous words are encoded using coding, where is the size of the vocabulary.

The input layer is then projected to a projection layer that has dimensionality , using a shared projection matrix.

The NNLM architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense. For a common choice of , the size of the projection layer () might be to while the hidden layer size is typically to units.

Moreover, the hidden layer is used to compute a probability distribution over all the words in the vocabulary, resulting in an output layer with dimensionality .

The computational complexity per each training example is , where the dominating term is .

However, several practical solutions were proposed for avoiding it; either using hierarchical versions of the softmax [referenced] or avoiding normalized models completely by using models that are not normalized during training [referenced]. With binary tree representations of the vocabulary, the number of output units that need to be evaluated can go down to around . Thus, most of the complexity is caused by the term .

In our models, we use hierarchical softmax where the vocabulary is represented as a Huffman binary tree. Huffman trees assign short binary codes to frequent words, and this further reduces the number of output units that need to be evaluated.

💡

- The computational complexity is the size of our vocabulary, . - We can sufficiently reduce it by using the binary tree structure. - . - We do it with the usage of the binary tree, where leaves represent probabilities of words; more specifically, leave with the index is the word probability and has position in the output softmax vector. - (to be continued)

2.2 Recurrent Neural Net Language Model (RNNLM)

Recurrent neural network based language model has been proposed to overcome certain limitations of the feedforward NNLM, such as the need to specify the context length (the order of the model ), and because theoretically, RNNs can efficiently represent more complex patterns than the shallow neural networks [referenced].

The complexity per training example of the RNN model is where the word representations have the same dimensionality as the hidden layer .

Again the hierarchical softmax is used.

(to be continued)

Distributed Representations of Words and Phrases and their Compositionality- T Mikolov 2013

By subsampling of the frequent words, we obtain significant speed up and also learn more regular word representations.

We also describe a simple alternative to the hierarchical softmax called negative sampling.

The Skip-gram model architecture. The training objective is to learn word vector representations that are good at predicting the nearby words.

Unlike most of the previously used neural network architectures for learning word vectors, training of the Skip-gram model (figure above) does not involve dense matrix multiplications. This makes the training extremely efficient: an optimized single-machine implementation can train on more than 100 billion words in one day.

In this paper, we present several extensions of the original Skip-gram model. We show that subsampling of frequent words during training results in a significant speedup (around 2x - 10x), and improves the accuracy of the representations of less frequent words.

In addition, we present a simplified variant of Noise Contrastive Estimation (NCE) [Michael U Gutmann- Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.] for training the Skip-gram model that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work.

The Skip-gram Model

The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document.

More formally, given a sequence of training words , the objective of the Skip-gram model is to maximize the average log probability:

💡

Jan 24, 2021 Assuming the conditional independence of context words given the center word we have:

where is the size of the training context (which can be a function of the center word ).

Larger results in more training examples and thus can lead to higher accuracy, at the expense of the training time.

The basic Skip-gram formulation defines using the softmax function:

Where and are the “input” and “output” vector representations of , and is the number of words in the vocabulary.

This formulation is impractical because the cost of computing is proportional to , which is often large ( terms).

Hierarchical Softmaxlabellabellabellabel

A computationally efficient approximation of the full softmax is the hierarchical softmax.

The main advantage is that instead of evaluating output nodes in the neural network to obtain the probability distribution, it is needed to evaluate only about nodes.

The hierarchical softmax uses a binary tree representation of the output layer with the words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words.

More precisely, each word can be reached by an appropriate path from the root of the tree. Let be the node on the path from the root to and let be the length of this path, so and . In addition, for any inner node , let be an arbitrary fixed child of and let be if is true and otherwise. Then the hierarchical softmax defines as follows:

where .

In our work we use a binary Huffman tree, as it assigns short codes to the frequent words which results in fast training.

2.2 Negative Sampling

An alternative to the hierarchical softmax is Noise Contrastive Estimation (NCE).

NCE posits that a good model should be able to differentiate data from noise by means of logistic regression.

We define Negative sampling (NEG) by the objective:

💡

Jan 24, 2021 positve sample score has to increase while negative samples scores must decrease.

which is used to replace every term in the Skip-gram objective. Thus the task is to distinguish the target word from draws from the noise distribution using logistic regression, where there are negative samples for each data sample.

Our experiments indicate that values of in the range are useful for small training datasets, while for large datasets the can be as small as .

The main difference between Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.

Jan 24, 2021

2.3 Subsampling of Frequent Words

In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g.,“in”, “the”, and “a”). Such words usually provide less information value than the rare words.

The vector representations of frequent words do not change significantly after training on several million examples.

To counter the imbalance between the rare and frequent words, we used a simple subsampling approach: each word in the training set is discarded with probability computed by the formula , where is the frequency of word and is a chosen threshold, typically around .

We chose this subsampling formula because it aggressively subsamples words whose frequency is greater than while preserving the ranking of the frequencies.

Although this subsampling formula was chosen heuristically, we found it to work well in practice. It accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words.

3 Empirical Results

The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate.

Video Lecture

Distributional similarity based representations: you can get a lot of value by representing a word utilizing its neighbors (it's context). "You shall know a word by the company it keeps. (J.R. Firth 1957)"

The idea of the skip-gram model is for each estimation step, you are taking one word as the center word and then you are going to try and predict words in their context out to some window size. And so the model is going to define a probability distribution that is the probability of a word appearing in the context given the center word. And we are going to choose vector representations of words so we can try and maximize that probability distribution. We just have one probability distribution of a context word, which we'll refer to as the output, occurring in the context close to the center word. All the parameters of this model are vector representations of each word. we actually have two representations for each word, one when the word is used as the center word and one when the word is used as a context word.

All the parameters of this model are vector representations of words. There are actually two vector representations for each word; one as a center word and the other as a context word. .

first term: second term: (chain rule) final form: first term is what we observed, the actual output context word appeared and the second term has the form of a expectation with the probability of every possible word appearing in the context and based on that probability we get taking that much of that , so this is in some sense, the expectation vector, the average over all the possible context vectors weighted by their likelihood of occurrence.

We are also going to calculate the partial derivatives with respect to the context vectors.

Implementation

Problem Statement

Given the session of historical viewed items of each user, compute a dense representation for each item.

Assumptions:

Let be the set of users.

Let be the of items.

Each item has features, composed of numerical features and categorical features ().

For each user, we have the sequence of his/her historical viewed items, i.e., , where is the sequence of historical viewed items of user and is the number of his/her viewed items.

Goal:

Compute a dense representation of each item such that each item's dense vector representation contains enough information content that is good at predicting the nearby items in users' sessions.

💡

Regarding the categorical features, we can either embed them in an end-to-end manner with the item's representation or embed them separately.

CarNext Products Dense Representation Learning Proposal

The structure of the data that Mehdi shared with me during the DataChef's hiring assignment is as follows:

Assuming that these features correspond to a real-world scenario (CarNext), we are going to learn dense representations for product_ids in the following way:

For each user, extract the sequence of product_ids

Write a dataloader that have the following functionalities given two arguments (context_window_size, num_negative_samples): 1- extract a context window including 'context_window_size' product_ids from all the sequences. 2- for any context_window, extract 'num_negative_samples' random negative samples

Write a model to embed product_ids into a dense space (e.g., dimensionality )

Implement and serve the model in SageMaker.

The model's architecture is as follows:

NOTE: instead of using 'product_id' feature for each product, we can use any number of other additional features such as image data, numeric data, other categorical data.

After the model's training finished, we can use the 'product encoder' to embed products and use the output embedding in any downstream task as a representation for the given product.

Graph Convolutional Matrix Completion

Paper

Introduction

We consider matrix completion for recommender systems from the point of view of link prediction on graphs. Interaction data such as movie ratings can be represented by a bipartite user-item graph with labeled edges denoting observed ratings.

Content information can naturally be included in this framework in the form of node features. Predicting ratings then reduces to predicting labeled links in the bipartite user-item graph.

We propose graph convolutional matrix completion (GC-MC): a graph-based auto-encoder framework for matrix completion, which builds on recent progress in deep learning on graphs.

The auto-encoder produces latent features of user and item nodes through a form of message passing on the bipartite interaction graph. These latent user and item representations are used to reconstruct the rating links through a bi-linear decoder.

The benefit of formulating matrix completion as a link prediction task on a bipartite graph becomes especially apparent when recommender graphs are accompanied with structured external information such as social networks. Combining such external information with interaction data can alleviate performance bottlenecks related to the problem.

Matrix completion as link prediction in bipartite graphs

Consider a rating matrix of shape , where is the number of users and is the number of items. Entries in this matrix encode either an observed rating (user rated item ) from a set of discrete possible rating values or the fact that the rating is unobserved (encoded by the value ). The task of matrix completion or recommendation can be seen as predicting the value of unobserved entries in .

In an equivalent picture, matrix completion or recommendation can be cast as a link prediction problem on a bipartite user-item interaction graph.

More precisely, the interaction data can be represented by an undirected graph with entities consisting of a collection of user nodes with , and item nodes with , such that . The edges carry labels that represent ordinal rating levels, such as .

2.1 Graph auto-encoders

Graph auto-encoders are comprised of: 1- a graph encoder model , which take as input an feature matrix and a graph adjacency matrix , and produce an node embedding matrix . 2- a pairwise decoder model , which takes pairs of node embedding and predicts respective entries in the adjacency matrix. Note that denotes the number of nodes, the number of input features, and the embedding size.

For bipartite recommender graphs , we can reformulate the encoder as , where is the adjacency matrix associated with rating type , such that contains ’s for those elements for which the original rating matrix contains observed ratings with value . and are now matrices of user and item embeddings with shape and , respectively. A single user (item) embedding takes the form of a real-valued vector () for user (item ).

Analogously, we can reformulate the decoder as , i.e. as a function acting on the user and item embeddings and returning a (reconstructed) rating matrix of shape . We can train this graph auto-encoder by minimizing the reconstruction error between the predicted ratings in and the observed ground-truth ratings in . Examples of metrics for the reconstruction error are the root mean square error, or the cross-entropy when treating the rating levels as different classes.

2.2 Graph convolutional encoder

In what follows, we propose a particular choice of encoder model that makes efficient use of weight sharing across locations in the graph and that assigns separate processing channels for each edge type (or rating type) .

In our case, we can assign a specific transformation for each rating level, resulting in edge-type specific messages from items to users of the following form: (1)

Here, is a normalization constant, which we choose to either be (left normalization) or (symmetric normalization) with denoting the set of neighbors of node . is an edge-type specific parameter matrix and is the (initial) feature vector of node . Messages from users to items are processed in an analogous way. After the message passing step, we accumulate incoming messages at every node by summing over all neighbors under a specific edge-type , and by subsequently accumulating them into a single vector representation: (2) where denotes an accumulation operation, such as , i.e. a concatenation of vectors (or matrices along their first dimension), or , i.e. summation of all messages. denotes an element-wise activation function such as the .

To arrive at the final embedding of user node , we transform the intermediate output as follows: (3)

The item embedding is calculated analogously with the same parameter matrix . In the presence of user- and item-specific side information we use separate parameter matrices for user and item embeddings. We will refer to (2) as a graph convolution layer and to (3) as a dense layer. Note that deeper models can be built by stacking several layers (in arbitrary combinations) with appropriate activation functions. In initial experiments, we found that stacking multiple convolutional layers did not improve performance, and a simple combination of a convolutional layer followed by a dense layer worked best.

It is worth mentioning that the model demonstrated here is only one particular possible, yet relatively simple choice of an encoder, and other variations are potentially worth exploring. Instead of a simple linear message transformation, one could explore variations where is a neural network in itself. Instead of choosing a specific normalization constant for individual messages, such as done here, one could employ some form of attention mechanism, where the individual contribution of each message is learned and determined by the model.

2.3 Bilinear decoder

For reconstructing links in the bipartite interaction graph we consider a bi-linear decoder and treat each rating level as a separate class. Indicating the reconstructed rating between user and item with , the decoder produces a probability distribution over possible rating levels through a bi-linear operation followed by the application of a softmax function: . (4)

with a trainable parameter matrix of shape , and the dimensionality of the hidden user (item) representations (). The predicted rating is computed as:

2.4 Model training

loss function: During model training, we minimize the following negative log likelihood of the predicted ratings :

(5)

with when and zero otherwise. The matrix serves as a mask for unobserved ratings, such that ones occur for elements corresponding to observed ratings in , and zeros for unobserved ratings. Hence, we only optimize overobserved ratings.

Node dropout: In order for the model to generalize well to unobserved ratings, it is trained in a denoising setup by randomly dropping out all outgoing messages of a particular node, with a probability , which we will refer to as node dropout. Messages are rescaled after dropout. In initial experiments we found that node dropout was more efficient in regularizing than message dropout. In the latter case individual outgoing messages are dropped out independently, making embeddings more robust against the presence or absence of single edges. In contrast, node dropout also causes embeddings to be more independent of particular user or item influences. We furthermore also apply regular dropout to the hidden layer units (3).

2.6 Vectorized implementation

In practice, we can use efficient sparse matrix multiplications, with complexity linear in the number of edges, i.e. , to implement the graph auto-encoder model. The graph convolutional encoder (Eq. 3), for example in the case of left normalization, can be vectorized as follows:

The summation in (8) can be replaced with concatenation, similar to (2). In this case denotes the diagonal node degree matrix with nonzero elements . Vectorization for an encoder with a symmetric normalization, as well as vectorization of the bilinear decoder, follows in an analogous manner. Note that it is only necessary to evaluate observed elements in , given by the mask in Eq. 6.

2.6 Input feature representation and side information

Implementation

💡

Please read the GCMC method to fully understand the concepts, notations, and implementations 🙏🏻

Introduction

This project is an implementation of the Graph Convolution Matrix Completion (GCMC) model as a recommender system for the CarNext company. The detailed mechanism of the model is described in the Graph Convolution Matrix Completion section of the current page.

As a high-level description of the project, this project implements a collaborative filtering based recommender system utilizing graph neural networks, to be more precise, spatial graph convolutional networks that can be interpreted as a message-passing neural network.

The dataset of this project is provided by CarNext and it contains the interactions (bidding action) of users (traders) with items (cars).

This project is implemented using Pytorch and Pytorch Geometric deep learning libraries. The model is trained and served in the AWS ecosystem utilizing services such as S3, SageMaker, and ECR.

The folder structure of the project is as follows:

In the following subsections, each directory is going to be explained and the functionality of each code block is going to be elaborated.

/src

This directory contains all the codes regarding data loading, model implementation, training procedure, inference procedure, and serving the model as an endpoint.

dataset.py

A generalization to collaborative filtering is to view the task as a matrix completion problem. In a collaborative filtering based recommender system, the system is provided by a rating matrix. The rating matrix is a by matrix where is the set of users and is the set of items and its entries are observed ratings that a certain user has assigned to a specific item.

The system has to utilize the collaborative information of users to infer unobserved ratings of the given matrix. Hence, one can view this task as a (rating) matrix completion problem. The rating matrix can be further represented by a bipartite graph. A bipartite graph is a graph whose vertices can be divided into two disjoint and independent sets (here the set of users) and (here the set of items) such that every edge connects a vertex in to one in . For instance, if the user has assigned a rating score to the item , we have and in the bipartite graph representation, there is an edge with the label , connecting the nodes and . An example of a bipartite graph is illustrated in Figure 1:

Figure 1: a bipartite graph with two disjoint and independent set of nodes U and V. The image is taken from https://en.wikipedia.org/wiki/Bipartite_graph.

The input data to the GCMC model is a bipartite graph. The dataset.py file implements a class to generate the required data structures.

In the CarNext RecSys project, for each user (trader) the ratings (number of bids) assigned to each item (car) are transformed into three classes: . For a given user, the first class , contains the items that the user rarely interacts with. The second class contains the items that the user has more frequent interactions with them and finally, the third class contains the most frequent items that the user has interactions with them.

👉🏻

For a given user, the items are sorted in descending order by their number of bids, and each class is constructed in a way so that it contains items that contribute to the of the given user's total number of bids.

Therefore, there are three bipartite graphs for this project, each representing the information of its corresponding class. The adjacency matrices of these graphs are constructed in advance, and they are stored in NumPy arrays S1, S2, and S3.

Additionally, there are word2vec embeddings of items that were learned (in another project) from the sequences of user-item interactions (all_sentences.pkl) and are stored in the NumPy array X. These embeddings are used as the side information of items.

Other data files are enc_df.csv, all_features_concat.pkl, all_mappings.pkl, which contain originally encoded dataframe, item encodings, and a mapping from item encodings to items original raw information, respectively.

The below code snippet is the dataset.py. it implements the UserItemBipartiteGraphDataset class which reads and processes the input data, and outputs the following data:

w2v_item_embeddings: A torch tensor of the shape () of word2vec item embeddings, which is used as the side information of item nodes.

edge_index: edge indices of adjacency matrices of the three classes of bipartite graphs.

edge_type: class type of each edge.

x_identity: an identity matrix of the size of the total nodes of the graph (#users + #items), which is used as the initial feature matrix of the nodes of the graph.

users_norm_r_list: A list containing three tensors that are used for left normalization of the messages that are passed from item nodes to user nodes.

items_norm_r_list: A list containing three tensors that are used for left normalization of the messages that are passed from user nodes to item nodes.

users_one_hot_degree: A tensor containing one-hot user degrees of each bipartite graph, which is used as the side information of user nodes.

train_val_test_indices: A dictionary with train, validation and test masks.

models.py

This file contains the implementation of the GCMC model. It implements three modules, namely GraphConvolutionalLayer, DenseLayer, and BilinearDecoder that are utilized to construct the GCMC model.

GraphConvolutionalLayer:

This module inherits the Pytorch Geometric MessagePassing class and is responsible for performing neural message passing. The message-passing neural networks (MPNN) generalize the convolution operator to irregular domains is typically expressed as a neighborhood aggregation or message passing scheme of the form:

where denotes the node features of node in layer and denotes (optional) edge features from node to node . denotes a differentiable, permutation invariant function, e.g., sum, mean or max, and and denote differentiable functions such as MLPs (Multi-Layer Perceptrons).

PyTorch Geometric provides the MessagePassing base class, which helps in creating such kinds of message passing graph neural networks by automatically taking care of message propagation. The user only has to define the functions , i.e. message(), and , i.e. update(), as well as the aggregation scheme to use, i.e. aggr="add", aggr="mean" or aggr="max".

The GraphConvolutionalLayer module has a parameter matrix self.T_s, which is used to calculate weight matrices of graph-convolution based on the concept of ordinal weight sharing introduced in the original paper Eq (11):

(11) As an example, the messages that are sent from items to users are used to calculate user embedding and is implemented as follows:

(1)

with the its corresponding message function,

For more information on the mechanics of Pytorch Geometric, pleas refer to its docs.

DenseLayer:

After the message-passing step, the resulting embeddings are mixed with the side information features of users and items through the DenseLayer to produce the final embedding of users and items. The computations of this step are based on Eq.(10)

where and are trainable weight matrices, and is a bias. The weight matrices and bias vector are different for users and items.

BilinearDecoder

For reconstructing links in the bipartite interaction graph, a bilinear decoder is used, and each rating level is treated as a separate class. Therefore the final embeddings of users and items are fed to the BilinearDecoder module to produce the predicted scores for each rating class.

Indicating the reconstructed rating between user and item with , the decoder produces a probability distribution over possible rating levels through a bilinear operation followed by the application of a softmax function:

with a trainable parameter matrix of shape , and the dimensionality of hidden user (item) representations.

As an effective means of regularization of the pairwise bilinear decoder, the weight sharing is done in the form of a linear combination of a set of basis weight matrices :

with and being the number of basis weight matrices. Here, are the learnable coefficients that determine the linear combination for each decoder weight matrix .

The defined modules are utilized to construct the entire GCMC model.

GCMC

train

The train file contains the training procedure of the model. As the SageMaker copies the information needed to perform the training into/opt/ml/ of the docker image, the _PREFIX = '/opt/ml/' is used for defining paths of the files.

The train() function starts by reading the training hyper-parameters. It then initializes the dataset and GCMC model. After that optimizer and its scheduler are defined. Early-stopping is used for the training loop with patience=100.

The objective loss function in this set to torch.nn.CrossEntropyLoss() and it computed using the calculate_losses() function which importer from utils.py. The CrossEntropy() loss function, adje_matrices (M_hat_r_list) that are estimated by the model, the ground-truth edge_index, and edge_type are passed to the calculate_losses() function along with the corresponding mask to calculate the loss function defined in Eq.(6) of the original paper.

After the training loss has been computed, the error is backpropagated through the network and the optimizer updates the parameters of the model. It is worth mentioning that the validation loss is computed for the early-stopping mechanism. Also training and validation accuracies are computed to inform the training performance status.

inference

After the model has been trained, its artifact (checkpoint) is saved in the specified S3 location. For deployment and serving the model, Flask is used as an application server. Following the SageMaker conventions, the Flask app contains two endpoints: /ping and /invocations. the /ping endpoint receives an HTTP GET request and determines if the container is working and healthy, i.e., the model is loaded correctly. /invocations is actually the endpoint that is used to produce the recommendations. It receives an HTTP POST request with the arguments passed to it as a JSON body. These arguments are item_code which specifies the encoded item's features and k the number of recommended users for each rating class.

The class ScoringService() is used to recommend top users for each rating class and the input item. It loads the model and the dataset and computes the predicted recommendation using the recommend_top_k_users() function imported from recommendation_utils.py.

After the recommendations has been computed, the results are returned to the client in the form of a JSON data.

Dockerfile

The Dockerfile builds an image that can do training and inference in SageMaker. This is an image that uses the Nginx, gunicorn, flask stack for serving inferences in a stable way. It starts from a python:3.6 image, installs the dependencies (such as Nginx) and requirements (such as gunicorn, flask, numpy, torch, and torch_geometric), copies the required files and data in their corresponding directory, and sets the executor.sh as the entrypoint of the image.

executor.sh

This script is the entrypoint of the docker image of the projects. it checks the argument passed to the docker run image. If the "train" argument is passed, it runs the train file, otherwise (where serve argument is passed) it runs the serve file.

build.sh

This script builds the docker image specified in the Dockerfile.

push.sh

This script pushes the Docker image to ECR to be ready for use by SageMaker. The argument for this script is the Sagemaker profile. This will be used along with the Docker image name on the local machine and combined with the account and region to form the repository name for ECR.

local_train.sh

This script simulates the training procedure of the Sagemaker on the local machine. It binds the contents of the local_test directory to the /opt/ml of the Docker image and runs it with the train argument.

local_deploy.sh

In the same way of local_train.sh, this script simulates the serving procedure of the Sagemaker on the local machine. It binds the contents of the local_test directory to the /opt/ml of the Docker image and runs it with the serve argument.

/local_test

The contents of this directory are formed in the same way that Sagemaker deals with artifacts and the required input information. In this way, Sagemaker training and serving can be simulated in the local machine.

Train and Serve the Model on AWS SageMaker

In order to train and serve the model on Sagemaker, a Jupyter notebook is used. After configuring a profile with the required policies, this notebook creates a boto3 and Sagemaker session.

After that, the S3 location where the training data is stored is defined along with the image URL in the ECR. Sagemaker Estimator module is used for the training and serving procedures. For the training, the hyper-parameters are set the fit method is called. This will execute the train file in the Docker container.

After the training is finished, the trained model's parameters are saved in the corresponding S3 location. By calling the deploy method of the estimator object, the serve file of the docker image is executed the model is served through a Sagemaker endpoint.

The endpoint can be checked by sending an HTTP POST request with the item_code and k arguments to its URL.

Next Steps

Refactoring the code

Organize the functions in the dataset class

Handle the use-case where the data is needed in the inference server as well

Add comments and docstrings

Handling exceptions

Add python code to perform Sagemaker training and serving from local machine.

Handling the zero class problem

Adding weights to loss function ✅

the class probabilities of the the dataset are p(*)=0.7742, p(**)=0.1645, p(***)=0.0612.

the weights=[1.26, 6.07, 16.33] of each class is then computed as: w(*)=(1/p(*))*100=1.26, w(**)=(1/p(**))*100=6.07, w(***)=(1/p(***))*100=16.33

the weights are passe to the CrossEntropLoss function. here is the results:

Implement the mini-batching technique of the original paper

Perform negative sampling to add the zero class

Add a constant parameter in the denominator of the softmax function

Productionizing the code

Handle response time issue

Handle cold-start problem

Creating a blueprint for ML projects

Create a template ML project that handles all the experienced issues from 0 to 100

Amirhossein's Technical Journal

Online Courses & Materials

The Good Parts of AWS

Part 1

AWS Certified Cloud Practitioner 2020

Objectives

Create A Billing Alarm- LAB —> CloudWatch

Let's Start To Clound! Identity Access Management (IAM)- LAB

IAM Best Practices

IAM Credential Reports

S3 101

Let's Create an S3 Bucket!- LAB

Let's Create an S3 Website! - LAB

S3 versioning

Packt- Hands-On Machine Learning Using Amazon SageMaker

Your First Machine Learning Model on SageMaker

Word2Vec

Papers

Efficient Estimation of Word Representations in Vector Space- T Mikolov 2013

Distributed Representations of Words and Phrases and their Compositionality- T Mikolov 2013

Video Lecture

Implementation

Problem Statement

CarNext Products Dense Representation Learning Proposal

Graph Convolutional Matrix Completion

Paper

Introduction

Matrix completion as link prediction in bipartite graphs

Implementation

Introduction

/src

train

inference

Dockerfile

executor.sh

build.sh

push.sh

local_train.sh

local_deploy.sh

/local_test

Train and Serve the Model on AWS SageMaker

Next Steps

Refactoring the code

Handling the zero class problem

Productionizing the code

Creating a blueprint for ML projects