Amazon MLS-C01 AWS Certified Machine Learning – Specialty Online Training

exams

3 years ago

Question #1

A Machine Learning Specialist is working with multiple data sources containing billions of records that need to be joined.

What feature engineering and model development approach should the Specialist take with a dataset this large?

A . Use an Amazon SageMaker notebook for both feature engineering and model development
B . Use an Amazon SageMaker notebook for feature engineering and Amazon ML for model development
C . Use Amazon EMR for feature engineering and Amazon SageMaker SDK for model development
D . Use Amazon ML for both feature engineering and model development.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Amazon EMR is a service that can process large amounts of data efficiently and cost-effectively. It can run distributed frameworks such as Apache Spark, which can perform feature engineering on big data. Amazon SageMaker SDK is a Python library that can interact with Amazon SageMaker service to train and deploy machine learning models. It can also use Amazon EMR as a data source for training data.

Reference: Amazon EMR

Amazon SageMaker SDK

Question #2

A Machine Learning Specialist has completed a proof of concept for a company using a small data sample and now the Specialist is ready to implement an end-to-end solution in AWS using Amazon SageMaker The historical training data is stored in Amazon RDS

Which approach should the Specialist use for training a model using that data?

A . Write a direct connection to the SQL database within the notebook and pull data in
B . Push the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and provide the S3 location within the notebook.
C . Move the data to Amazon DynamoDB and set up a connection to DynamoDB within the notebook to pull data in
D . Move the data to Amazon ElastiCache using AWS DMS and set up a connection within the
notebook to pull data in for fast access.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Pushing the data from Microsoft SQL Server to Amazon S3 using an AWS Data Pipeline and providing the S3 location within the notebook is the best approach for training a model using the data stored in Amazon RDS. This is because Amazon SageMaker can directly access data from Amazon S3 and train models on it. AWS Data Pipeline is a service that can automate the movement and transformation of data between different AWS services. It can also use Amazon RDS as a data source and Amazon S3 as a data destination. This way, the data can be transferred efficiently and securely without writing any code within the notebook.

Reference: Amazon SageMaker

AWS Data Pipeline

Question #3

Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other?

A . Recall
B . Misclassification rate
C . Mean absolute percentage error (MAPE)
D . Area Under the ROC Curve (AUC)

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Area Under the ROC Curve (AUC) is a metric that measures the performance of a binary classifier across all possible thresholds. It is also known as the probability that a randomly chosen positive example will be ranked higher than a randomly chosen negative example by the classifier. AUC is a good metric to compare different classification models because it is independent of the class distribution and the decision threshold. It also captures both the sensitivity (true positive rate) and the specificity (true negative rate) of the model.

Reference: AWS Machine Learning Specialty Exam Guide

AWS Machine Learning Specialty Sample Questions

Question #4

A Machine Learning Specialist is using Amazon Sage Maker to host a model for a highly available customer-facing application.

The Specialist has trained a new version of the model, validated it with historical data, and now wants to deploy it to production. To limit any risk of a negative customer experience, the Specialist wants to be able to monitor the model and roll it back, if needed

What is the SIMPLEST approach with the LEAST risk to deploy the model and roll it back, if needed?

A . Create a SageMaker endpoint and configuration for the new model version. Redirect production traffic to the new endpoint by updating the client configuration. Revert traffic to the last version if the model does not perform as expected.
B . Create a SageMaker endpoint and configuration for the new model version. Redirect production traffic to the new endpoint by using a load balancer Revert traffic to the last version if the model does not perform as expected.
C . Update the existing SageMaker endpoint to use a new configuration that is weighted to send 5% of the traffic to the new variant. Revert traffic to the last version by resetting the weights if the model does not perform as expected.
D . Update the existing SageMaker endpoint to use a new configuration that is weighted to send 100% of the traffic to the new variant Revert traffic to the last version by resetting the weights if the model does not perform as expected.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Updating the existing SageMaker endpoint to use a new configuration that is weighted to send 5% of the traffic to the new variant is the simplest approach with the least risk to deploy the model and roll it back, if needed. This is because SageMaker supports A/B testing, which allows the Specialist to compare the performance of different model variants by sending a portion of the traffic to each variant. The Specialist can monitor the metrics of each variant and adjust the weights accordingly. If the new variant does not perform as expected, the Specialist can revert traffic to the last version by resetting the weights to 100% for the old variant and 0% for the new variant. This way, the Specialist can deploy the model without affecting the customer experience and roll it back easily if needed.

Reference: Amazon SageMaker

Deploying models to Amazon SageMaker hosting services

Question #5

A manufacturing company has a large set of labeled historical sales data The manufacturer would like to predict how many units of a particular part should be produced each quarter.

Which machine learning approach should be used to solve this problem?

A . Logistic regression
B . Random Cut Forest (RCF)
C . Principal component analysis (PCA)
D . Linear regression

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Linear regression is a machine learning approach that can be used to solve this problem. Linear regression is a supervised learning technique that can model the relationship between one or more input variables (features) and an output variable (target). In this case, the input variables could be the historical sales data of the part, such as the quarter, the demand, the price, the inventory, etc. The output variable could be the number of units to be produced for the part. Linear regression can learn the coefficients (weights) of the input variables that best fit the output variable, and then use them to make predictions for new data. Linear regression is suitable for problems that involve continuous and numeric output variables, such as predicting house prices, stock prices, or sales volumes.

Reference: AWS Machine Learning Specialty Exam Guide

Linear Regression

Question #6

A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket A Machine Learning Specialist wants to use SQL to run queries on this data.

Which solution requires the LEAST effort to be able to query this data?

A . Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.
B . Use AWS Glue to catalogue the data and Amazon Athena to run queries
C . Use AWS Batch to run ETL on the data and Amazon Aurora to run the quenes
D . Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

AWS Glue is a serverless data integration service that can catalogue, clean, enrich, and move data between various data stores. Amazon Athena is an interactive query service that can run SQL queries on data stored in Amazon S3. By using AWS Glue to catalogue the data and Amazon Athena to run queries, the Machine Learning Specialist can leverage the existing data in Amazon S3 without any additional data transformation or loading. This solution requires the least effort compared to the other options, which involve more complex and costly data processing and storage services.

Reference: AWS Glue, Amazon Athena

Question #7

A Machine Learning Specialist is packaging a custom ResNet model into a Docker container so the company can leverage Amazon SageMaker for training The Specialist is using Amazon EC2 P3 instances to train the model and needs to properly configure the Docker container to leverage the NVIDIA GPUs

What does the Specialist need to do1?

A . Bundle the NVIDIA drivers with the Docker image
B . Build the Docker container to be NVIDIA-Docker compatible
C . Organize the Docker container’s file structure to execute on GPU instances.
D . Set the GPU flag in the Amazon SageMaker Create TrainingJob request body

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

To leverage the NVIDIA GPUs on Amazon EC2 P3 instances, the Machine Learning Specialist needs to build the Docker container to be NVIDIA-Docker compatible. NVIDIA-Docker is a tool that enables GPU-accelerated containers to run on Docker. It automatically configures the container to access the NVIDIA drivers and libraries on the host system. The Specialist does not need to bundle the NVIDIA drivers with the Docker image, as they are already installed on the EC2 P3 instances. The Specialist does not need to organize the Docker container’s file structure to execute on GPU instances, as this is not relevant for GPU compatibility. The Specialist does not need to set the GPU flag in the Amazon SageMaker Create TrainingJob request body, as this is only required for using Elastic Inference accelerators, not EC2 P3 instances.

Reference: NVIDIA-Docker, Using GPU-Accelerated Containers, Using Elastic Inference in Amazon SageMaker

Question #8

A large JSON dataset for a project has been uploaded to a private Amazon S3 bucket The Machine Learning Specialist wants to securely access and explore the data from an Amazon SageMaker notebook instance A new VPC was created and assigned to the Specialist

How can the privacy and integrity of the data stored in Amazon S3 be maintained while granting access to the Specialist for analysis?

A . Launch the SageMaker notebook instance within the VPC with SageMaker-provided internet access enabled Use an S3 ACL to open read privileges to the everyone group
B . Launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data Copy the JSON dataset from Amazon S3 into the ML storage volume on the SageMaker notebook instance and work against the local dataset
C . Launch the SageMaker notebook instance within the VPC and create an S3 VPC endpoint for the notebook to access the data Define a custom S3 bucket policy to only allow requests from your VPC to access the S3 bucket
D . Launch the SageMaker notebook instance within the VPC with SageMaker-provided internet access enabled. Generate an S3 pre-signed URL for access to data in the bucket

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The best way to maintain the privacy and integrity of the data stored in Amazon S3 is to use a combination of VPC endpoints and S3 bucket policies. A VPC endpoint allows the SageMaker notebook instance to access the S3 bucket without going through the public internet. A bucket policy allows the S3 bucket owner to specify which VPCs or VPC endpoints can access the bucket. This way, the data is protected from unauthorized access and tampering. The other options are either insecure (A and D) or inefficient (B).

Reference: Using Amazon S3 VPC Endpoints, Using Bucket Policies and User Policies

Question #9

Given the following confusion matrix for a movie classification model, what is the true class frequency for Romance and the predicted class frequency for Adventure?

A . The true class frequency for Romance is 77.56% and the predicted class frequency for Adventure is 20 85%
B . The true class frequency for Romance is 57.92% and the predicted class frequency for Adventure is 1312%
C . The true class frequency for Romance is 0 78 and the predicted class frequency for Adventure is (0 47 – 0.32).
D . The true class frequency for Romance is 77.56% * 0.78 and the predicted class frequency for Adventure is 20 85% ‘ 0.32

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The true class frequency for Romance is the percentage of movies that are actually Romance out of all the movies. This can be calculated by dividing the sum of the true values for Romance by the total number of movies. The predicted class frequency for Adventure is the percentage of movies that are predicted to be Adventure out of all the movies. This can be calculated by dividing the sum of the predicted values for Adventure by the total number of movies. Based on the confusion matrix, the true class frequency for Romance is 57.92% and the predicted class frequency for Adventure is 13.12%.

Reference: Confusion Matrix, Classification Metrics

Question #10

A Machine Learning Specialist is building a supervised model that will evaluate customers’ satisfaction with their mobile phone service based on recent usage The model’s output should infer whether or not a customer is likely to switch to a competitor in the next 30 days.

Which of the following modeling techniques should the Specialist use1?

A . Time-series prediction
B . Anomaly detection
C . Binary classification
D . Regression

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The modeling technique that the Machine Learning Specialist should use is binary classification. Binary classification is a type of supervised learning that predicts whether an input belongs to one of two possible classes. In this case, the input is the customer’s recent usage data and the output is whether or not the customer is likely to switch to a competitor in the next 30 days. This is a binary outcome, either yes or no, so binary classification is suitable for this problem. The other options are not appropriate for this problem. Time-series prediction is a type of supervised learning that forecasts future values based on past and present data. Anomaly detection is a type of unsupervised learning that identifies outliers or abnormal patterns in the data. Regression is a type of supervised learning that estimates a continuous numerical value based on the input features.

Reference: Binary Classification, Time Series Prediction, Anomaly Detection, Regression

Question #11

A web-based company wants to improve its conversion rate on its landing page Using a large historical dataset of customer visits, the company has repeatedly trained a multi-class deep learning network algorithm on Amazon SageMaker However there is an overfitting problem training data shows 90% accuracy in predictions, while test data shows 70% accuracy only

The company needs to boost the generalization of its model before deploying it into production to maximize conversions of visits to purchases

Which action is recommended to provide the HIGHEST accuracy model for the company’s test and validation data?

A . Increase the randomization of training data in the mini-batches used in training.
B . Allocate a higher proportion of the overall data to the training dataset
C . Apply L1 or L2 regularization and dropouts to the training.
D . Reduce the number of layers and units (or neurons) from the deep learning network.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Regularization and dropouts are techniques that can help reduce overfitting in deep learning models. Overfitting occurs when the model learns too much from the training data and fails to generalize well to new data. Regularization adds a penalty term to the loss function that penalizes the model for having large or complex weights. This prevents the model from memorizing the noise or irrelevant features in the training data. L1 and L2 are two types of regularization that differ in how they calculate the penalty term. L1 regularization uses the absolute value of the weights, while L2 regularization uses the square of the weights. Dropouts are another technique that randomly drops out some units or neurons from the network during training. This creates a thinner network that is less prone to overfitting. Dropouts also act as a form of ensemble learning, where multiple sub-models are combined to produce a better prediction. By applying regularization and dropouts to the training, the web-based company can improve the generalization and accuracy of its deep learning model on the test and validation data.

Reference: Regularization: A video that explains the concept and benefits of regularization in deep learning.

Dropout: A video that demonstrates how dropout works and why it helps reduce overfitting.

Question #12

A Machine Learning Specialist was given a dataset consisting of unlabeled data The Specialist must create a model that can help the team classify the data into different buckets.

What model should be used to complete this work?

A . K-means clustering
B . Random Cut Forest (RCF)
C . XGBoost
D . BlazingText

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

K-means clustering is a machine learning technique that can be used to classify unlabeled data into different groups based on their similarity. It is an unsupervised learning method, which means it does not require any prior knowledge or labels for the data. K-means clustering works by randomly assigning data points to a number of clusters, then iteratively updating the cluster centers and reassigning the data points until the clusters are stable. The result is a partition of the data into distinct and homogeneous groups. K-means clustering can be useful for exploratory data analysis, data compression, anomaly detection, and feature extraction.

Reference: K-Means Clustering: A tutorial on how to use K-means clustering with Amazon SageMaker.

Unsupervised Learning: A video that explains the concept and applications of unsupervised learning.

Question #13

A retail company intends to use machine learning to categorize new products A labeled dataset of current products was provided to the Data Science team The dataset includes 1 200 products The labeled dataset has 15 features for each product such as title dimensions, weight, and price Each product is labeled as belonging to one of six categories such as books, games, electronics, and movies.

Which model should be used for categorizing new products using the provided dataset for training?

A . An XGBoost model where the objective parameter is set to multi: softmax
B . A deep convolutional neural network (CNN) with a softmax activation function for the last layer
C . A regression forest where the number of trees is set equal to the number of product categories
D . A DeepAR forecasting model based on a recurrent neural network (RNN)

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

XGBoost is a machine learning framework that can be used for classification, regression, ranking, and other tasks. It is based on the gradient boosting algorithm, which builds an ensemble of weak learners (usually decision trees) to produce a strong learner. XGBoost has several advantages over other algorithms, such as scalability, parallelization, regularization, and sparsity handling. For categorizing new products using the provided dataset, an XGBoost model would be a suitable choice, because it can handle multiple features and multiple classes efficiently and accurately. To train an XGBoost model for multi-class classification, the objective parameter should be set to multi: softmax, which means that the model will output a probability distribution over the classes and predict the class with the highest probability. Alternatively, the objective parameter can be set to multi: softprob, which means that the model will output the raw probability of each class instead of the predicted class label. This can be useful for evaluating the model performance or for post-processing the predictions.

Reference: XGBoost: A tutorial on how to use XGBoost with Amazon SageMaker.

XGBoost Parameters: A reference guide for the parameters of XGBoost.

Question #14

A Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors While exploring the data, the Specialist notices that the magnitude of the input features vary greatly The Specialist does not want variables with a larger magnitude to dominate the model

What should the Specialist do to prepare the data for model training’?

A . Apply quantile binning to group the data into categorical bins to keep any relationships in the data by replacing the magnitude with distribution
B . Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude
C . Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude
D . Apply the orthogonal sparse Diagram (OSB) transformation to apply a fixed-size sliding window to generate new features of a similar magnitude.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Normalization is a data preprocessing technique that can be used to scale the input features to a common range, such as [-1, 1] or [0, 1]. Normalization can help reduce the effect of outliers, improve the convergence of gradient-based algorithms, and prevent variables with a larger magnitude from dominating the model. One common method of normalization is standardization, which transforms each feature to have a mean of 0 and a variance of 1. This can be done by subtracting the mean and dividing by the standard deviation of each feature. Standardization can be useful for models that assume the input features are normally distributed, such as linear regression, logistic regression, and support vector machines.

Reference: Data normalization and standardization: A video that explains the concept and benefits of data normalization and standardization.

Standardize or Normalize?: A blog post that compares different methods of scaling the input features.

Question #15

A Machine Learning Specialist prepared the following graph displaying the results of k-means for k = [1:10]

Considering the graph, what is a reasonable selection for the optimal choice of k?

A . 1
B . 4
C . 7
D . 10

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The elbow method is a technique that we use to determine the number of centroids (k) to use in a k-means clustering algorithm. In this method, we plot the within-cluster sum of squares (WCSS) against the number of clusters (k) and look for the point where the curve bends sharply. This point is called the elbow point and it indicates that adding more clusters does not improve the model significantly. The graph in the question shows that the elbow point is at k = 4, which means that 4 is a reasonable choice for the optimal number of clusters.

Reference: Elbow Method for optimal value of k in KMeans: A tutorial on how to use the elbow method with Amazon SageMaker.

K-Means Clustering: A video that explains the concept and benefits of k-means clustering.

Question #16

A company is using Amazon Polly to translate plaintext documents to speech for automated company announcements However company acronyms are being mispronounced in the current documents

How should a Machine Learning Specialist address this issue for future documents?

A . Convert current documents to SSML with pronunciation tags
B . Create an appropriate pronunciation lexicon.
C . Output speech marks to guide in pronunciation
D . Use Amazon Lex to preprocess the text files for pronunciation

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

A pronunciation lexicon is a file that defines how words or phrases should be pronounced by Amazon Polly. A lexicon can help customize the speech output for words that are uncommon, foreign, or have multiple pronunciations. A lexicon must conform to the Pronunciation Lexicon Specification (PLS) standard and can be stored in an AWS region using the Amazon Polly API. To use a lexicon for synthesizing speech, the lexicon name must be specified in the <speak> SSML tag.

For example, the following lexicon defines how to pronounce the acronym W3C:

<lexicon version=“1.0” xmlns=“http://www.w3.org/2005/01/pronunciation-lexicon” alphabet=“ipa” xml:lang=“en-US”> <lexeme> <grapheme>W3C</grapheme> <alias>World Wide Web Consortium</alias> </lexeme> </lexicon>

To use this lexicon, the text input must include the following SSML tag:

<speak version=“1.1” xmlns=“http://www.w3.org/2001/10/synthesis” xml:lang=“en-US”> <voice

name=“Joanna”> <lexicon name=“w3c_lexicon”/> The <say-as interpret-as=“characters”>W3C</say-

as> is an international community that develops open standards to ensure the long-term growth of

the Web. </voice> </speak>

Reference: Customize pronunciation using lexicons in Amazon Polly: A blog post that explains how to use lexicons for creating custom pronunciations.

Managing Lexicons: A documentation page that describes how to store and retrieve lexicons using the Amazon Polly API.

Question #17

A Machine Learning Specialist is using Apache Spark for pre-processing training data. As part of the Spark pipeline, the Specialist wants to use Amazon SageMaker for training a model and hosting it.

Which of the following would the Specialist do to integrate the Spark application with SageMaker? (Select THREE)

A . Download the AWS SDK for the Spark environment
B . Install the SageMaker Spark library in the Spark environment.
C . Use the appropriate estimator from the SageMaker Spark Library to train a model.
D . Compress the training data into a ZIP file and upload it to a pre-defined Amazon S3 bucket.
E . Use the sageMakerModel. transform method to get inferences from the model hosted in SageMaker
F . Convert the DataFrame object to a CSV file, and use the CSV file as input for obtaining inferences from SageMaker.

Reveal Solution Hide Solution

Correct Answer: B, C, E
B, C, E

Explanation:

The SageMaker Spark library is a library that enables Apache Spark applications to integrate with Amazon SageMaker for training and hosting machine learning models.

The library provides several features, such as:

Estimators: Classes that allow Spark users to train Amazon SageMaker models and host them on Amazon SageMaker endpoints using the Spark MLlib Pipelines API. The library supports various built-in algorithms, such as linear learner, XGBoost, K-means, etc., as well as custom algorithms using Docker containers.

Model classes: Classes that wrap Amazon SageMaker models in a Spark MLlib Model abstraction. This allows Spark users to use Amazon SageMaker endpoints for inference within Spark applications. Data sources: Classes that allow Spark users to read data from Amazon S3 using the Spark Data Sources API. The library supports various data formats, such as CSV, LibSVM, RecordIO, etc.

To integrate the Spark application with SageMaker, the Machine Learning Specialist should do the following:

Install the SageMaker Spark library in the Spark environment. This can be done by using Maven, pip, or downloading the JAR file from GitHub.

Use the appropriate estimator from the SageMaker Spark Library to train a model.

For example, to train a linear learner model, the Specialist can use the following code:

Use the sageMakerModel. transform method to get inferences from the model hosted in SageMaker.

For example, to get predictions for a test DataFrame, the Specialist can use the following code:

Reference: [SageMaker Spark]: A documentation page that introduces the SageMaker Spark library and its features.

[SageMaker Spark GitHub Repository]: A GitHub repository that contains the source code, examples, and installation instructions for the SageMaker Spark library.

Question #18

A Machine Learning Specialist is working with a large cybersecurily company that manages security events in real time for companies around the world The cybersecurity company wants to design a solution that will allow it to use machine learning to score malicious events as anomalies on the data as it is being ingested. The company also wants be able to save the results in its data lake for later processing and analysis

What is the MOST efficient way to accomplish these tasks?

A . Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection Then use Kinesis Data Firehose to stream the results to Amazon S3
B . Ingest the data into Apache Spark Streaming using Amazon EMR. and use Spark MLlib with k-means to perform anomaly detection Then store the results in an Apache Hadoop Distributed File System (HDFS) using Amazon EMR with a replication factor of three as the data lake
C . Ingest the data and store it in Amazon S3 Use AWS Batch along with the AWS Deep Learning AMIs to train a k-means model using TensorFlow on the data in Amazon S3.
D . Ingest the data and store it in Amazon S3. Have an AWS Glue job that is triggered on demand transform the new data Then use the built-in Random Cut Forest (RCF) model within Amazon SageMaker to detect anomalies in the data

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Amazon Kinesis Data Firehose is a fully managed service that can capture, transform, and load streaming data into AWS data stores, such as Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk. It can also invoke AWS Lambda functions to perform custom transformations on the data. Amazon Kinesis Data Analytics is a service that can analyze streaming data in real time using SQL or Apache Flink applications. It can also use machine learning algorithms, such as Random Cut Forest (RCF), to perform anomaly detection on streaming data. RCF is an unsupervised learning algorithm that assigns an anomaly score to each data point based on how different it is from the rest of the data. By using Kinesis Data Firehose and Kinesis Data Analytics, the cybersecurity company can ingest the data in real time, score the malicious events as anomalies, and stream the results to Amazon S3, which can serve as a data lake for later processing and analysis.

This is the most efficient way to accomplish these tasks, as it does not require any additional infrastructure, coding, or training.

Reference: Amazon Kinesis Data Firehose – Amazon Web Services

Amazon Kinesis Data Analytics – Amazon Web Services

Anomaly Detection with Amazon Kinesis Data Analytics – Amazon Web Services [AWS Certified Machine Learning – Specialty Sample Questions]

Question #19

A Machine Learning Specialist works for a credit card processing company and needs to predict which transactions may be fraudulent in near-real time. Specifically, the Specialist must train a model that returns the probability that a given transaction may be fraudulent.

How should the Specialist frame this business problem?

A . Streaming classification
B . Binary classification
C . Multi-category classification
D . Regression classification

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Binary classification is a type of supervised learning problem where the goal is to predict a categorical label that has only two possible values, such as Yes or No, True or False, Positive or Negative. In this case, the label is whether a transaction is fraudulent or not, which is a binary outcome. Binary classification can be used to estimate the probability of an observation belonging to a certain class, such as the probability of a transaction being fraudulent. This can help the business to make decisions based on the risk level of each transaction.

Reference: Binary Classification – Amazon Machine Learning

AWS Certified Machine Learning – Specialty Sample Questions

Question #20

Amazon Connect has recently been tolled out across a company as a contact call center The solution has been configured to store voice call recordings on Amazon S3

The content of the voice calls are being analyzed for the incidents being discussed by the call operators Amazon Transcribe is being used to convert the audio to text, and the output is stored on Amazon S3

Which approach will provide the information required for further analysis?

A . Use Amazon Comprehend with the transcribed files to build the key topics
B . Use Amazon Translate with the transcribed files to train and build a model for the key topics
C . Use the AWS Deep Learning AMI with Gluon Semantic Segmentation on the transcribed files to train and build a model for the key topics
D . Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the transcribed files to generate a word embeddings dictionary for the key topics

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. It can analyze text documents and identify the key topics, entities, sentiments, languages, and more. In this case, Amazon Comprehend can be used with the transcribed files from Amazon Transcribe to extract the main topics that are being discussed by the call operators. This can help to understand the common issues and concerns of the customers, and provide insights for further analysis and improvement.

Reference: Amazon Comprehend – Amazon Web Services

AWS Certified Machine Learning – Specialty Sample Questions

Question #21

A Machine Learning Specialist is building a prediction model for a large number of features using linear models, such as linear regression and logistic regression During exploratory data analysis the Specialist observes that many features are highly correlated with each other This may make the model unstable

What should be done to reduce the impact of having such a large number of features?

A . Perform one-hot encoding on highly correlated features
B . Use matrix multiplication on highly correlated features.
C . Create a new feature space using principal component analysis (PCA)
D . Apply the Pearson correlation coefficient

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Principal component analysis (PCA) is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on. By using PCA, the impact of having a large number of features that are highly correlated with each other can be reduced, as the new feature space will have fewer dimensions and less redundancy. This can make the linear models more stable and less prone to overfitting.

Reference:

Principal Component Analysis (PCA) Algorithm – Amazon SageMaker

Perform a large-scale principal component analysis faster using Amazon SageMaker | AWS Machine Learning Blog

Machine Learning- Prinicipal Component Analysis | i2tutorials

Question #22

A Machine Learning Specialist wants to determine the appropriate SageMaker Variant Invocations Per Instance setting for an endpoint automatic scaling configuration. The Specialist has performed a load test on a single instance and determined that peak requests per second (RPS) without service degradation is about 20 RPS As this is the first deployment, the Specialist intends to set the invocation safety factor to 0 5

Based on the stated parameters and given that the invocations per instance setting is measured on a per-minute basis, what should the Specialist set as the sageMaker variant invocations Per instance setting?

A . 10
B . 30
C . 600
D . 2,400

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The SageMaker Variant Invocations Per Instance setting is the target value for the average number of invocations per instance per minute for the model variant. It is used by the automatic scaling policy to add or remove instances to keep the metric close to the specified value. To determine this value, the following equation can be used in combination with load testing: SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60

Where MAX_RPS is the maximum requests per second that the model variant can handle without service degradation, SAFETY_FACTOR is a factor that ensures that the clients do not exceed the maximum RPS, and 60 is the conversion factor from seconds to minutes. In this case, the given parameters are:

MAX_RPS = 20 SAFETY_FACTOR = 0.5

Plugging these values into the equation, we get:

SageMakerVariantInvocationsPerInstance = (20 * 0.5) * 60 SageMakerVariantInvocationsPerInstance = 600

Therefore, the Specialist should set the SageMaker Variant Invocations Per Instance setting to 600.

Reference: Load testing your auto scaling configuration – Amazon SageMaker Configure model auto scaling with the console – Amazon SageMaker

Question #23

A Machine Learning Specialist deployed a model that provides product recommendations on a company’s website Initially, the model was performing very well and resulted in customers buying more products on average However within the past few months the Specialist has noticed that the effect of product recommendations has diminished and customers are starting to return to their original habits of spending less The Specialist is unsure of what happened, as the model has not changed from its initial deployment over a year ago

Which method should the Specialist try to improve model performance?

A . The model needs to be completely re-engineered because it is unable to handle product inventory changes
B . The model’s hyperparameters should be periodically updated to prevent drift
C . The model should be periodically retrained from scratch using the original data while adding a regularization term to handle product inventory changes
D . The model should be periodically retrained using the original training data plus new data as product inventory changes

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

The problem that the Machine Learning Specialist is facing is likely due to concept drift, which is a phenomenon where the statistical properties of the target variable change over time, making the model less accurate and relevant. Concept drift can occur due to various reasons, such as changes in customer preferences, market trends, product inventory, seasonality, etc. In this case, the product recommendations model may have become outdated as the product inventory changed over time, making the recommendations less appealing to the customers. To address this issue, the model should be periodically retrained using the original training data plus new data as product inventory changes. This way, the model can learn from the latest data and adapt to the changing customer behavior and preferences. Retraining the model from scratch using the original data while adding a regularization term may not be sufficient, as it does not account for the new data. Updating the model’s hyperparameters may not help either, as it does not address the underlying data distribution change. Re-engineering the model completely may not be necessary, as the model may still be valid and useful with periodic retraining.

Reference: Concept Drift – Amazon SageMaker

Detecting and Handling Concept Drift – Amazon SageMaker Machine Learning Concepts – Amazon Machine Learning

Question #24

A manufacturer of car engines collects data from cars as they are being driven. The data collected includes timestamp, engine temperature, rotations per minute (RPM), and other sensor readings The company wants to predict when an engine is going to have a problem so it can notify drivers in advance to get engine maintenance The engine data is loaded into a data lake for training.

Which is the MOST suitable predictive model that can be deployed into production’?

A . Add labels over time to indicate which engine faults occur at what time in the future to turn this into a supervised learning problem Use a recurrent neural network (RNN) to train the model to recognize when an engine might need maintenance for a certain fault.
B . This data requires an unsupervised learning algorithm Use Amazon SageMaker k-means to cluster the data
C . Add labels over time to indicate which engine faults occur at what time in the future to turn this into a supervised learning problem Use a convolutional neural network (CNN) to train the model to recognize when an engine might need maintenance for a certain fault.
D . This data is already formulated as a time series Use Amazon SageMaker seq2seq to model the time series.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

A recurrent neural network (RNN) is a type of neural network that can process sequential data, such as time series, by maintaining a hidden state that captures the temporal dependencies between the inputs. RNNs are well suited for predicting future events based on past observations, such as forecasting engine failures based on sensor readings. To train an RNN model, the data needs to be labeled with the target variable, which in this case is the type and time of the engine fault. This makes the problem a supervised learning problem, where the goal is to learn a mapping from the input sequence (sensor readings) to the output sequence (engine faults). By using an RNN model, the manufacturer can leverage the temporal information in the data and detect patterns that indicate when an engine might need maintenance for a certain fault.

Reference: Recurrent Neural Networks – Amazon SageMaker

Use Amazon SageMaker Built-in Algorithms or Pre-trained Models Recurrent Neural Network Definition | DeepAI

What are Recurrent Neural Networks? An Ultimate Guide for Newbies! Lee and Carter go Machine Learning: Recurrent Neural Networks – SSRN

Question #25

A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset

Which tool should be used to improve the validation accuracy?

A . Amazon Comprehend syntax analysts and entity detection
B . Amazon SageMaker BlazingText allow mode
C . Natural Language Toolkit (NLTK) stemming and stop word removal
D . Scikit-learn term frequency-inverse document frequency (TF-IDF) vectorizers

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Term frequency-inverse document frequency (TF-IDF) is a technique that assigns a weight to each word in a document based on how important it is to the meaning of the document. The term frequency (TF) measures how often a word appears in a document, while the inverse document frequency (IDF) measures how rare a word is across a collection of documents. The TF-IDF weight is the product of the TF and IDF values, and it is high for words that are frequent in a specific document but rare in the overall corpus. TF-IDF can help improve the validation accuracy of a sentiment analysis model by reducing the impact of common words that have little or no sentiment value, such as “the”, “a”, “and”, etc. Scikit-learn is a popular Python library for machine learning that provides a TF-IDF vectorizer class that can transform a collection of text documents into a matrix of TF-IDF features. By using this tool, the Data Scientist can create a more informative and discriminative feature representation for the sentiment analysis task.

Reference: TfidfVectorizer – scikit-learn

Text feature extraction – scikit-learn

TF-IDF for Beginners | by Jana Schmidt | Towards Data Science

Sentiment Analysis: Concept, Analysis and Applications | by Susan Li | Towards Data Science

Question #26

A Machine Learning Specialist is developing recommendation engine for a photography blog Given a picture, the recommendation engine should show a picture that captures similar objects The Specialist would like to create a numerical representation feature to perform nearest-neighbor searches

What actions would allow the Specialist to get relevant numerical representations?

A . Reduce image resolution and use reduced resolution pixel values as features
B . Use Amazon Mechanical Turk to label image content and create a one-hot representation indicating the presence of specific labels
C . Run images through a neural network pie-trained on ImageNet, and collect the feature vectors from the penultimate layer
D . Average colors by channel to obtain three-dimensional representations of images.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

A neural network pre-trained on ImageNet is a deep learning model that has been trained on a large dataset of images containing 1000 classes of objects. The model can learn to extract high-level features from the images that capture the semantic and visual information of the objects. The penultimate layer of the model is the layer before the final output layer, and it contains a feature vector that represents the input image in a lower-dimensional space. By running images through a pre-trained neural network and collecting the feature vectors from the penultimate layer, the Specialist can obtain relevant numerical representations that can be used for nearest-neighbor searches. The feature vectors can capture the similarity between images based on the presence and appearance of similar objects, and they can be compared using distance metrics such as Euclidean distance or cosine similarity. This approach can enable the recommendation engine to show a picture that captures similar objects to a given picture.

Reference: ImageNet – Wikipedia

How to use a pre-trained neural network to extract features from images | by Rishabh Anand | Analytics Vidhya | Medium

Image Similarity using Deep Ranking | by Aditya Oke | Towards Data Science

Question #27

A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year The company has gathered a labeled dataset from 1 million users

The training dataset consists of 1.000 positive samples (from users who ended up paying within 1 year) and 999.000 negative samples (from users who did not use any paid features) Each data sample consists of 200 features including user age, device, location, and play patterns Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set However, the prediction results on a test dataset were not satisfactory.

Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO.)

A . Add more deep trees to the random forest to enable the model to learn more features.
B . indicate a copy of the samples in the test database in the training dataset
C . Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.
D . Change the cost function so that false negatives have a higher impact on the cost value than false positives
E . Change the cost function so that false positives have a higher impact on the cost value than false negatives

Reveal Solution Hide Solution

Correct Answer: C, D
C, D

Explanation:

The Data Science team is facing a problem of imbalanced data, where the positive class (paid users) is much less frequent than the negative class (non-paid users). This can cause the random forest model to be biased towards the majority class and have poor performance on the minority class. To mitigate this issue, the Data Science team can try the following approaches:

C) Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This is a technique called data augmentation, which can help increase the size and diversity of the training data for the minority class. This can help the random forest model learn more features and patterns from the positive class and reduce the imbalance ratio.

D) Change the cost function so that false negatives have a higher impact on the cost value than false positives. This is a technique called cost-sensitive learning, which can assign different weights or costs to different classes or errors. By assigning a higher cost to false negatives (predicting non-paid when the user is actually paid), the random forest model can be more sensitive to the minority class and try to minimize the misclassification of the positive class.

Reference: Bagging and Random Forest for Imbalanced Classification Surviving in a Random Forest with Imbalanced Datasets

machine learning – random forest for imbalanced data? – Cross Validated Biased Random Forest For Dealing With the Class Imbalance Problem

Question #28

While reviewing the histogram for residuals on regression evaluation data a Machine Learning Specialist notices that the residuals do not form a zero-centered bell shape as shown.

What does this mean?

A . The model might have prediction errors over a range of target values.
B . The dataset cannot be accurately represented using the regression model
C . There are too many variables in the model
D . The model is predicting its target values perfectly.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Residuals are the differences between the actual and predicted values of the target variable in a regression model. A histogram of residuals is a graphical tool that can help evaluate the performance and assumptions of the model. Ideally, the histogram of residuals should have a zero-centered bell shape, which indicates that the residuals are normally distributed with a mean of zero and a constant variance. This means that the model has captured the true relationship between the input and output variables, and that the errors are random and unbiased. However, if the histogram of residuals does not have a zero-centered bell shape, as shown in the image, this means that the model might have prediction errors over a range of target values. This is because the residuals do not form a symmetrical and homogeneous distribution around zero, which implies that the model has some systematic bias or heteroscedasticity. This can affect the accuracy and validity of the model, and indicate that the model needs to be improved or modified.

Reference: Residual Analysis in Regression – Statistics By Jim

How to Check Residual Plots for Regression Analysis – dummies Histogram of Residuals – Statistics How To

Question #29

During mini-batch training of a neural network for a classification problem, a Data Scientist notices that training accuracy oscillates.

What is the MOST likely cause of this issue?

A . The class distribution in the dataset is imbalanced
B . Dataset shuffling is disabled
C . The batch size is too big
D . The learning rate is very high

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Mini-batch gradient descent is a variant of gradient descent that updates the model parameters using a subset of the training data (called a mini-batch) at each iteration. The learning rate is a hyperparameter that controls how much the model parameters change in response to the gradient. If the learning rate is very high, the model parameters may overshoot the optimal values and oscillate around the minimum of the cost function. This can cause the training accuracy to fluctuate and prevent the model from converging to a stable solution. To avoid this issue, the learning rate should be chosen carefully, such as by using a learning rate decay schedule or an adaptive learning rate algorithm1. Alternatively, the batch size can be increased to reduce the variance of the gradient estimates2. However, the batch size should not be too big, as this can slow down the training process and reduce the generalization ability of the model3. Dataset shuffling and class distribution are not likely to cause oscillations in training accuracy, as they do not affect the gradient updates directly. Dataset shuffling can help avoid getting stuck in local minima and improve the convergence speed of mini-batch gradient descent4. Class distribution can affect the performance and fairness of the model, especially if the dataset is imbalanced, but it does not necessarily cause fluctuations in training accuracy.

Question #30

A Machine Learning Specialist observes several performance problems with the training portion of a machine learning solution on Amazon SageMaker The solution uses a large training dataset 2 TB in size and is using the SageMaker k-means algorithm The observed issues include the unacceptable length of time it takes before the training job launches and poor I/O throughput while training the model

What should the Specialist do to address the performance issues with the current solution?

A . Use the SageMaker batch transform feature
B . Compress the training data into Apache Parquet format.
C . Ensure that the input mode for the training job is set to Pipe.
D . Copy the training dataset to an Amazon EFS volume mounted on the SageMaker instance.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The input mode for the training job determines how the training data is transferred from Amazon S3 to the SageMaker instance. There are two input modes: File and Pipe. File mode copies the entire training dataset from S3 to the local file system of the instance before starting the training job. This can cause a long delay before the training job launches, especially if the dataset is large. Pipe mode streams the data from S3 to the instance as the training job runs. This can reduce the startup time and improve the I/O throughput, as the data is read in smaller batches. Therefore, to address the performance issues with the current solution, the Specialist should ensure that the input mode for the training job is set to Pipe. This can be done by using the SageMaker Python SDK and setting the input_mode parameter to Pipe when creating the estimator or the fit method12. Alternatively, this can be done by using the AWS CLI and setting the InputMode parameter to Pipe when creating the training job3.

Reference: Access Training Data – Amazon SageMaker

Choosing Data Input Mode Using the SageMaker Python SDK – Amazon SageMaker CreateTrainingJob – Amazon SageMaker Service

Question #31

A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes

Which function will produce the desired output?

A . Dropout
B . Smooth L1 loss
C . Softmax
D . Rectified linear units (ReLU)

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The softmax function is a function that can transform a vector of arbitrary real values into a vector of real values in the range (0,1) that sum to 1. This means that the softmax function can produce a valid probability distribution over multiple classes. The softmax function is often used as the activation function of the output layer in a neural network, especially for multi-class classification problems. The softmax function can assign higher probabilities to the classes with higher scores, which allows the network to make predictions based on the most likely class. In this case, the Machine Learning Specialist wants to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes of animals. Therefore, the softmax function is the most suitable function to produce the desired output.

Reference: Softmax Activation Function for Deep Learning: A Complete Guide What is Softmax in Machine Learning? – reason.town

machine learning – Why is the softmax function often used as activation … Multi-Class Neural Networks: Softmax | Machine Learning | Google for …

Question #32

A Machine Learning Specialist is building a model that will perform time series forecasting using Amazon SageMaker The Specialist has finished training the model and is now planning to perform load testing on the endpoint so they can configure Auto Scaling for the model variant

Which approach will allow the Specialist to review the latency, memory utilization, and CPU utilization during the load test"?

A . Review SageMaker logs that have been written to Amazon S3 by leveraging Amazon Athena and Amazon OuickSight to visualize logs as they are being produced
B . Generate an Amazon CloudWatch dashboard to create a single view for the latency, memory utilization, and CPU utilization metrics that are outputted by Amazon SageMaker
C . Build custom Amazon CloudWatch Logs and then leverage Amazon ES and Kibana to query and visualize the data as it is generated by Amazon SageMaker
D . Send Amazon CloudWatch Logs that were generated by Amazon SageMaker lo Amazon ES and use Kibana to query and visualize the log data.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Amazon CloudWatch is a service that can monitor and collect various metrics and logs from AWS resources, such as Amazon SageMaker. Amazon CloudWatch can also generate dashboards to create a single view for the metrics and logs that are of interest. By using Amazon CloudWatch, the Machine Learning Specialist can review the latency, memory utilization, and CPU utilization during the load test, as these are some of the metrics that are outputted by Amazon SageMaker. The Specialist can create a custom dashboard that displays these metrics in different widgets, such as graphs, tables, or text. The dashboard can also be configured to refresh automatically and show the latest data as the load test is running. This approach will allow the Specialist to monitor the performance and resource utilization of the model variant and adjust the Auto Scaling configuration accordingly.

Reference: [Monitoring Amazon SageMaker with Amazon CloudWatch – Amazon SageMaker] [Using Amazon CloudWatch Dashboards – Amazon CloudWatch] [Create a CloudWatch Dashboard – Amazon CloudWatch]

Question #33

An Amazon SageMaker notebook instance is launched into Amazon VPC. The SageMaker notebook references data contained in an Amazon S3 bucket in another account The bucket is encrypted using SSE-KMS The instance returns an access denied error when trying to access data in Amazon S3.

Which of the following are required to access the bucket and avoid the access denied error? (Select THREE)

A . An AWS KMS key policy that allows access to the customer master key (CMK)
B . A SageMaker notebook security group that allows access to Amazon S3
C . An 1AM role that allows access to the specific S3 bucket
D . A permissive S3 bucket policy
E . An S3 bucket owner that matches the notebook owner
F . A SegaMaker notebook subnet ACL that allow traffic to Amazon S3.

Reveal Solution Hide Solution

Correct Answer: A, B, C
A, B, C

Explanation:

To access an Amazon S3 bucket in another account that is encrypted using SSE-KMS, the following are required:

A) An AWS KMS key policy that allows access to the customer master key (CMK). The CMK is the encryption key that is used to encrypt and decrypt the data in the S3 bucket. The KMS key policy defines who can use and manage the CMK. To allow access to the CMK from another account, the key policy must include a statement that grants the necessary permissions (such as kms:Decrypt) to the principal from the other account (such as the SageMaker notebook IAM role).

B) A SageMaker notebook security group that allows access to Amazon S3. A security group is a virtual firewall that controls the inbound and outbound traffic for the SageMaker notebook instance. To allow the notebook instance to access the S3 bucket, the security group must have a rule that allows outbound traffic to the S3 endpoint on port 443 (HTTPS).

C) An IAM role that allows access to the specific S3 bucket. An IAM role is an identity that can be assumed by the SageMaker notebook instance to access AWS resources. The IAM role must have a policy that grants the necessary permissions (such as s3:GetObject) to access the specific S3 bucket. The policy must also include a condition that allows access to the CMK in the other account.

The following are not required or correct:

D) A permissive S3 bucket policy. A bucket policy is a resource-based policy that defines who can access the S3 bucket and what actions they can perform. A permissive bucket policy is not required and not recommended, as it can expose the bucket to unauthorized access. A bucket policy should follow the principle of least privilege and grant the minimum permissions necessary to the specific principals that need access.

E) An S3 bucket owner that matches the notebook owner. The S3 bucket owner and the notebook owner do not need to match, as long as the bucket owner grants cross-account access to the notebook owner through the KMS key policy and the bucket policy (if applicable).

F) A SegaMaker notebook subnet ACL that allow traffic to Amazon S3. A subnet ACL is a network access control list that acts as an optional layer of security for the SageMaker notebook instance’s subnet. A subnet ACL is not required to access the S3 bucket, as the security group is sufficient to control the traffic. However, if a subnet ACL is used, it must not block the traffic to the S3 endpoint.

Question #34

A monitoring service generates 1 TB of scale metrics record data every minute A Research team performs queries on this data using Amazon Athena The queries run slowly due to the large volume of data, and the team requires better performance

How should the records be stored in Amazon S3 to improve query performance?

A . CSV files
B . Parquet files
C . Compressed JSON
D . RecordIO

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Parquet is a columnar storage format that can store data in a compressed and efficient way. Parquet files can improve query performance by reducing the amount of data that needs to be scanned, as only the relevant columns are read from the files. Parquet files can also support predicate pushdown, which means that the filtering conditions are applied at the storage level, further reducing the data that needs to be processed. Parquet files are compatible with Amazon Athena, which can leverage the benefits of the columnar format and provide faster and cheaper queries. Therefore, the records should be stored in Parquet files in Amazon S3 to improve query performance.

Reference: Columnar Storage Formats – Amazon Athena

Parquet SerDe – Amazon Athena

Optimizing Amazon Athena Queries – Amazon Athena

Parquet – Apache Software Foundation

Question #35

A Machine Learning Specialist needs to create a data repository to hold a large amount of time-based training data for a new model. In the source system, new files are added every hour Throughout a single 24-hour period, the volume of hourly updates will change significantly. The Specialist always wants to train on the last 24 hours of the data

Which type of data repository is the MOST cost-effective solution?

A . An Amazon EBS-backed Amazon EC2 instance with hourly directories
B . An Amazon RDS database with hourly table partitions
C . An Amazon S3 data lake with hourly object prefixes
D . An Amazon EMR cluster with hourly hive partitions on Amazon EBS volumes

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

An Amazon S3 data lake is a cost-effective solution for storing and analyzing large amounts of time-based training data for a new model. Amazon S3 is a highly scalable, durable, and secure object storage service that can store any amount of data in any format. Amazon S3 also offers low-cost storage classes, such as S3 Standard-IA and S3 One Zone-IA, that can reduce the storage costs for infrequently accessed data. By using hourly object prefixes, the Machine Learning Specialist can organize the data into logical partitions based on the time of ingestion. This can enable efficient data access and management, as well as support incremental updates and deletes. The Specialist can also use Amazon S3 lifecycle policies to automatically transition the data to lower-cost storage classes or delete the data after a certain period of time. This way, the Specialist can always train on the last 24 hours of the data and optimize the storage costs.

Reference: What is a data lake? – Amazon Web Services

Amazon S3 Storage Classes – Amazon Simple Storage Service Managing your storage lifecycle – Amazon Simple Storage Service Best Practices Design Patterns: Optimizing Amazon S3 Performance

Question #36

A retail chain has been ingesting purchasing records from its network of 20,000 stores to Amazon S3 using Amazon Kinesis Data Firehose To support training an improved machine learning model, training records will require new but simple transformations, and some attributes will be combined. The model needs lo be retrained daily

Given the large number of stores and the legacy data ingestion, which change will require the LEAST amount of development effort?

A . Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3 then use AWS Glue to do the transformation
B . Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3
C . Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3.
D . Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehouse stream that transforms raw record attributes into simple transformed values using SQL.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Amazon Kinesis Data Analytics is a service that can analyze streaming data in real time using SQL or Apache Flink applications. It can also use machine learning algorithms, such as Random Cut Forest (RCF), to perform anomaly detection on streaming data. By inserting a Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream, the retail chain can transform the raw record attributes into simple transformed values using SQL queries. This can be done without changing the existing data ingestion process or deploying additional resources. The transformed records can then be outputted to another Kinesis Data Firehose stream that delivers them to Amazon S3 for training the machine learning model. This approach will require the least amount of development effort, as it leverages the existing Kinesis Data Firehose stream and the built-in SQL capabilities of Kinesis Data Analytics.

Reference: Amazon Kinesis Data Analytics – Amazon Web Services

Anomaly Detection with Amazon Kinesis Data Analytics – Amazon Web Services Amazon Kinesis Data Firehose – Amazon Web Services Amazon S3 – Amazon Web Services

Question #37

A city wants to monitor its air quality to address the consequences of air pollution A Machine Learning Specialist needs to forecast the air quality in parts per million of contaminates for the next 2 days in the city as this is a prototype, only daily data from the last year is available.

Which model is MOST likely to provide the best results in Amazon SageMaker?

A . Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
B . Use Amazon SageMaker Random Cut Forest (RCF) on the single time series consisting of the full year of data.
C . Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
D . Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of classifier.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm is a supervised learning algorithm that can perform both classification and regression tasks. It can also handle time series data, such as the air quality data in this case. The kNN algorithm works by finding the k most similar instances in the training data to a given query instance, and then predicting the output based on the average or majority of the outputs of the k nearest neighbors. The kNN algorithm can be configured to use different distance metrics, such as Euclidean or cosine, to measure the similarity between instances.

To use the kNN algorithm on the single time series consisting of the full year of data, the Machine Learning Specialist needs to set the predictor_type parameter to regressor, as the output variable (air quality in parts per million of contaminates) is a continuous value. The kNN algorithm can then forecast the air quality for the next 2 days by finding the k most similar days in the past year and averaging their air quality values.

Reference: Amazon SageMaker k-Nearest-Neighbors (kNN) Algorithm – Amazon SageMaker Time Series Forecasting using k-Nearest Neighbors (kNN) in Python | by … Time Series Forecasting with k-Nearest Neighbors | by Nishant Malik …

Question #38

For the given confusion matrix, what is the recall and precision of the model?

A . Recall = 0.92 Precision = 0.84
B . Recall = 0.84 Precision = 0.8
C . Recall = 0.92 Precision = 0.8
D . Recall = 0.8 Precision = 0.92

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Recall and precision are two metrics that can be used to evaluate the performance of a classification model. Recall is the ratio of true positives to the total number of actual positives, which measures how well the model can identify all the relevant cases. Precision is the ratio of true positives to the total number of predicted positives, which measures how accurate the model is when it makes a positive prediction.

Based on the confusion matrix in the image, we can calculate the recall and precision as follows:

Recall = TP / (TP + FN) = 12 / (12 + 1) = 0.92

Precision = TP / (TP + FP) = 12 / (12 + 3) = 0.8

Where TP is the number of true positives, FN is the number of false negatives, and FP is the number of false positives. Therefore, the recall and precision of the model are 0.92 and 0.8, respectively.

Question #39

A Machine Learning Specialist is working with a media company to perform classification on popular articles from the company’s website. The company is using random forests to classify how popular an article will be before it is published.

A sample of the data being used is below.

Given the dataset, the Specialist wants to convert the Day-Of_Week column to binary values.

What technique should be used to convert this column to binary values.

A . Binarization
B . One-hot encoding
C . Tokenization
D . Normalization transformation

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

One-hot encoding is a technique that can be used to convert a categorical variable, such as the Day-Of_Week column, to binary values. One-hot encoding creates a new binary column for each unique value in the original column, and assigns a value of 1 to the column that corresponds to the value in the original column, and 0 to the rest. For example, if the original column has values Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday, one-hot encoding will create seven new columns, each representing one day of the week. If the value in the original column is Tuesday, then the column for Tuesday will have a value of 1, and the other columns will have a value of 0.

One-hot encoding can help improve the performance of machine learning models, as it eliminates the ordinal relationship between the values and creates a more informative and sparse representation of the data.

Reference: One-Hot Encoding – Amazon SageMaker One-Hot Encoding: A Simple Guide for Beginners | by Jana Schmidt … One-Hot Encoding in Machine Learning | by Nishant Malik | Towards …

Question #40

A company has raw user and transaction data stored in AmazonS3 a MySQL database, and Amazon RedShift A Data Scientist needs to perform an analysis by joining the three datasets from Amazon S3, MySQL, and Amazon RedShift, and then calculating the average-of a few selected columns from the joined data

Which AWS service should the Data Scientist use?

A . Amazon Athena
B . Amazon Redshift Spectrum
C . AWS Glue
D . Amazon QuickSight

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Amazon Athena is a serverless interactive query service that can analyze data in Amazon S3 using standard SQL. Amazon Athena can also query data from other sources, such as MySQL and Amazon Redshift, by using federated queries. Federated queries allow Amazon Athena to run SQL queries across data sources, such as relational and non-relational databases, data warehouses, and data lakes. By using Amazon Athena, the Data Scientist can perform an analysis by joining the three datasets from Amazon S3, MySQL, and Amazon Redshift, and then calculating the average of a few selected columns from the joined data. Amazon Athena can also integrate with other AWS services, such as AWS Glue and Amazon QuickSight, to provide additional features, such as data cataloging and visualization.

Reference:

What is Amazon Athena? – Amazon Athena

Federated Query Overview – Amazon Athena

Querying Data from Amazon S3 – Amazon Athena

Querying Data from MySQL – Amazon Athena

[Querying Data from Amazon Redshift – Amazon Athena]

Question #41

A Mobile Network Operator is building an analytics platform to analyze and optimize a company’s operations using Amazon Athena and Amazon S3.

The source systems send data in CSV format in real lime The Data Engineering team wants to transform the data to the Apache Parquet format before storing it on Amazon S3.

Which solution takes the LEAST effort to implement?

A . Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet
B . Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.
C . Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet.
D . Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Amazon Kinesis Data Streams is a service that can capture, store, and process streaming data in real time. Amazon Kinesis Data Firehose is a service that can deliver streaming data to various destinations, such as Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service. Amazon Kinesis Data Firehose can also transform the data before delivering it, such as converting the data format, compressing the data, or encrypting the data. One of the supported data formats that Amazon Kinesis Data Firehose can convert to is Apache Parquet, which is a columnar storage format that can improve the performance and cost-efficiency of analytics queries. By using Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose, the Mobile Network Operator can ingest the .CSV data from the source systems and use Amazon Kinesis Data Firehose to convert the data into Parquet before storing it on Amazon S3. This solution takes the least effort to implement, as it does not require any additional resources, such as Amazon EC2 instances, Amazon EMR clusters, or Amazon Glue jobs. The solution can also leverage the built-in features of Amazon Kinesis Data Firehose, such as data buffering, batching, retry, and error handling.

Reference:

Amazon Kinesis Data Streams – Amazon Web Services

Amazon Kinesis Data Firehose – Amazon Web Services

Data Transformation – Amazon Kinesis Data Firehose

Apache Parquet – Amazon Athena

Question #42

An e-commerce company needs a customized training model to classify images of its shirts and pants products The company needs a proof of concept in 2 to 3 days with good accuracy.

Which compute choice should the Machine Learning Specialist select to train and achieve good accuracy on the model quickly?

A . m5 4xlarge (general purpose)
B . r5.2xlarge (memory optimized)
C . p3.2xlarge (GPU accelerated computing)
D . p3 8xlarge (GPU accelerated computing)

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Image classification is a machine learning task that involves assigning labels to images based on their content. Image classification can be performed using various algorithms, such as convolutional neural networks (CNNs), which are a type of deep learning model that can learn to extract high-level features from images. To train a customized image classification model, the e-commerce company needs a compute choice that can support the high computational demands of deep learning and provide good accuracy on the model quickly. A GPU accelerated computing instance, such as p3.2xlarge, is a suitable choice for this task, as it can leverage the parallel processing power of GPUs to speed up the training process and reduce the training time. A p3.2xlarge instance has one NVIDIA Tesla V100 GPU, which can provide up to 125 teraflops of mixed-precision performance and 16 GB of GPU memory. A p3.2xlarge instance can also use various deep learning frameworks, such as TensorFlow, PyTorch, MXNet, etc., to build and train the image classification model. A p3.2xlarge instance is also more cost-effective than a p3.8xlarge instance, which has four NVIDIA Tesla V100 GPUs, as the latter may not be necessary for a proof of concept with a small dataset. Therefore, the Machine Learning Specialist should select p3.2xlarge as the compute choice to train and achieve good accuracy on the model quickly.

Reference:

Amazon EC2 P3 Instances – Amazon Web Services

Image Classification – Amazon SageMaker

Convolutional Neural Networks – Amazon SageMaker

Deep Learning AMIs – Amazon Web Services

Question #43

A Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers

Currently, the company has the following data in Amazon Aurora

• Profiles for all past and existing customers

• Profiles for all past and existing insured pets

• Policy-level information

• Premiums received

• Claims paid

What steps should be taken to implement a machine learning model to identify potential new customers on social media?

A . Use regression on customer profile data to understand key characteristics of consumer segments Find similar profiles on social media.
B . Use clustering on customer profile data to understand key characteristics of consumer segments Find similar profiles on social media.
C . Use a recommendation engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media
D . Use a decision tree classifier engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Clustering is a machine learning technique that can group data points into clusters based on their similarity or proximity. Clustering can help discover the underlying structure and patterns in the data, as well as identify outliers or anomalies. Clustering can also be used for customer segmentation, which is the process of dividing customers into groups based on their characteristics, behaviors, preferences, or needs. Customer segmentation can help understand the key features and needs of different customer segments, as well as design and implement targeted marketing campaigns for each segment. In this case, the Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers. To do this, the Manager can use clustering on customer profile data to understand the key characteristics of consumer segments, such as their demographics, pet types, policy preferences, premiums paid, claims made, etc. The Manager can then find similar profiles on social media, such as Facebook, Twitter, Instagram, etc., by using the cluster features as filters or keywords. The Manager can then target these potential new customers with personalized and relevant ads or offers that match their segment’s needs and interests. This way, the Manager can implement a machine learning model to identify potential new customers on social media.

Question #44

A company is running an Amazon SageMaker training job that will access data stored in its Amazon S3 bucket A compliance policy requires that the data never be transmitted across the internet.

How should the company set up the job?

A . Launch the notebook instances in a public subnet and access the data through the public S3 endpoint
B . Launch the notebook instances in a private subnet and access the data through a NAT gateway
C . Launch the notebook instances in a public subnet and access the data through a NAT gateway
D . Launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

A private subnet is a subnet that does not have a route to the internet gateway, which means that the resources in the private subnet cannot access the internet or be accessed from the internet. An S3 VPC endpoint is a gateway endpoint that allows the resources in the VPC to access the S3 service without going through the internet. By launching the notebook instances in a private subnet and accessing the data through an S3 VPC endpoint, the company can set up the job in a secure and compliant way, as the data never leaves the AWS network and is not exposed to the internet. This can also improve the performance and reliability of the data transfer, as the traffic does not depend on the internet bandwidth or availability.

Reference:

Amazon VPC Endpoints – Amazon Virtual Private Cloud

Endpoints for Amazon S3 – Amazon Virtual Private Cloud Connect to SageMaker Within your VPC – Amazon SageMaker Working with VPCs and Subnets – Amazon Virtual Private Cloud

Question #45

A Machine Learning Specialist is preparing data for training on Amazon SageMaker The Specialist is transformed into a numpy .array, which appears to be negatively affecting the speed of the training.

What should the Specialist do to optimize the data for training on SageMaker?

A . Use the SageMaker batch transform feature to transform the training data into a DataFrame
B . Use AWS Glue to compress the data into the Apache Parquet format
C . Transform the dataset into the Recordio protobuf format
D . Use the SageMaker hyperparameter optimization feature to automatically optimize the data

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The Recordio protobuf format is a binary data format that is optimized for training on SageMaker. It allows faster data loading and lower memory usage compared to other formats such as CSV or numpy arrays. The Recordio protobuf format also supports features such as sparse input, variable-length input, and label embedding. To use the Recordio protobuf format, the data needs to be serialized and deserialized using the appropriate libraries. Some of the built-in algorithms in SageMaker support the Recordio protobuf format as a content type for training and inference.

Reference: Common Data Formats for Training

Using RecordIO Format

Content Types Supported by Built-in Algorithms

Question #46

A Machine Learning Specialist is training a model to identify the make and model of vehicles in images The Specialist wants to use transfer learning and an existing model trained on images of general objects The Specialist collated a large custom dataset of pictures containing different vehicle makes and models.

What should the Specialist do to initialize the model to re-train it with the custom data?

A . Initialize the model with random weights in all layers including the last fully connected layer
B . Initialize the model with pre-trained weights in all layers and replace the last fully connected layer.
C . Initialize the model with random weights in all layers and replace the last fully connected layer
D . Initialize the model with pre-trained weights in all layers including the last fully connected layer

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Transfer learning is a technique that allows us to use a model trained for a certain task as a starting point for a machine learning model for a different task. For image classification, a common practice is to use a pre-trained model that was trained on a large and general dataset, such as ImageNet, and then customize it for the specific task. One way to customize the model is to replace the last fully connected layer, which is responsible for the final classification, with a new layer that has the same number of units as the number of classes in the new task. This way, the model can leverage the features learned by the previous layers, which are generic and useful for many image recognition tasks, and learn to map them to the new classes. The new layer can be initialized with random weights, and the rest of the model can be initialized with the pre-trained weights. This method is also known as feature extraction, as it extracts meaningful features from the pre-trained model and uses them for the new task.

Reference: Transfer learning and fine-tuning

Deep transfer learning for image classification: a survey

Question #47

A Machine Learning Specialist is developing a custom video recommendation model for an application The dataset used to train this model is very large with millions of data points and is hosted in an Amazon S3 bucket The Specialist wants to avoid loading all of this data onto an Amazon SageMaker notebook instance because it would take hours to move and will exceed the attached 5 GB Amazon EBS volume on the notebook instance.

Which approach allows the Specialist to use all the data to train the model?

A . Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training
code is executing and the model parameters seem reasonable. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
B . Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to the instance. Train on a small amount of the data to verify the training code and hyperparameters. Go back to Amazon SageMaker and train using the full dataset
C . Use AWS Glue to train a model using a small subset of the data to confirm that the data will be compatible with Amazon SageMaker. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
D . Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to train the full dataset.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Pipe input mode is a feature of Amazon SageMaker that allows streaming large datasets from Amazon S3 directly to the training algorithm without downloading them to the local disk. This reduces the startup time, disk space, and cost of training jobs. Pipe input mode is supported by most of the built-in algorithms and can also be used with custom training algorithms.

To use Pipe input mode, the data needs to be in a binary format such as protobuf recordIO or TFRecord. The training code needs to use the PipeModeDataset class to read the data from the named pipe provided by SageMaker. To verify that the training code and the model parameters are working as expected, it is recommended to train locally on a smaller subset of the data before launching a full-scale training job on SageMaker. This approach is faster and more efficient than the other options, which involve either downloading the full dataset to an EC2 instance or using AWS Glue, which is not designed for training machine learning models.

Reference: Using Pipe input mode for Amazon SageMaker algorithms Using Pipe Mode with Your Own Algorithms PipeModeDataset Class

Question #48

A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions

Here is an example from the dataset

"The quck BROWN FOX jumps over the lazy dog "

Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Select THREE)

A . Perform part-of-speech tagging and keep the action verb and the nouns only
B . Normalize all words by making the sentence lowercase
C . Remove stop words using an English stopword dictionary.
D . Correct the typography on "quck" to "quick."
E . One-hot encode all words in the sentence
F . Tokenize the sentence into words.

Reveal Solution Hide Solution

Correct Answer: B, C, F
B, C, F

Explanation:

To prepare the data for Word2Vec, the Specialist needs to perform some preprocessing steps that can help reduce the noise and complexity of the data, as well as improve the quality of the embeddings.

Some of the common preprocessing steps for Word2Vec are:

Normalizing all words by making the sentence lowercase: This can help reduce the vocabulary size and treat words with different capitalizations as the same word. For example, “Fox” and “fox” should be considered as the same word, not two different words.

Removing stop words using an English stopword dictionary: Stop words are words that are very common and do not carry much semantic meaning, such as “the”, “a”, “and”, etc. Removing them can help focus on the words that are more relevant and informative for the task.

Tokenizing the sentence into words: Tokenization is the process of splitting a sentence into smaller units, such as words or subwords. This is necessary for Word2Vec, as it operates on the word level and requires a list of words as input.

The other options are not necessary or appropriate for Word2Vec:

Performing part-of-speech tagging and keeping the action verb and the nouns only: Part-of-speech tagging is the process of assigning a grammatical category to each word, such as noun, verb, adjective, etc. This can be useful for some natural language processing tasks, but not for Word2Vec,

as it can lose some important information and context by discarding other words.

Correcting the typography on “quck” to “quick”: Typo correction can be helpful for some tasks, but not for Word2Vec, as it can introduce errors and inconsistencies in the data. For example, if the typo is intentional or part of a dialect, correcting it can change the meaning or style of the sentence. Moreover, Word2Vec can learn to handle typos and variations in spelling by learning similar embeddings for them.

One-hot encoding all words in the sentence: One-hot encoding is a way of representing words as vectors of 0s and 1s, where only one element is 1 and the rest are 0. The index of the 1 element corresponds to the word’s position in the vocabulary. For example, if the vocabulary is [“cat”, “dog”, “fox”], then “cat” can be encoded as [1, 0, 0], “dog” as [0, 1, 0], and “fox” as [0, 0, 1]. This can be useful for some machine learning models, but not for Word2Vec, as it does not capture the semantic similarity and relationship between words. Word2Vec aims to learn dense and low-dimensional embeddings for words, where similar words have similar vectors.

Question #49

This graph shows the training and validation loss against the epochs for a neural network

The network being trained is as follows

• Two dense layers one output neuron

• 100 neurons in each layer

• 100 epochs

• Random initialization of weights

Which technique can be used to improve model performance in terms of accuracy in the validation set?

A . Early stopping
B . Random initialization of weights with appropriate seed
C . Increasing the number of epochs
D . Adding another layer with the 100 neurons

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Early stopping is a technique that can be used to prevent overfitting and improve model performance on the validation set. Overfitting occurs when the model learns the training data too well and fails to generalize to new and unseen data. This can be seen in the graph, where the training loss keeps decreasing, but the validation loss starts to increase after some point. This means that the model is fitting the noise and patterns in the training data that are not relevant for the validation data. Early stopping is a way of stopping the training process before the model overfits the training data. It works by monitoring the validation loss and stopping the training when the validation loss stops decreasing or starts increasing. This way, the model is saved at the point where it has the best performance on the validation set. Early stopping can also save time and resources by reducing the number of epochs needed for training.

Reference: Early Stopping

How to Stop Training Deep Neural Networks At the Right Time Using Early Stopping

Question #50

A manufacturing company asks its Machine Learning Specialist to develop a model that classifies defective parts into one of eight defect types. The company has provided roughly 100000 images per defect type for training During the injial training of the image classification model the Specialist notices that the validation accuracy is 80%, while the training accuracy is 90% It is known that human-level performance for this type of image classification is around 90%

What should the Specialist consider to fix this issue1?

A . A longer training time
B . Making the network larger
C . Using a different optimizer
D . Using some form of regularization

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Regularization is a technique that can be used to prevent overfitting and improve model performance on unseen data. Overfitting occurs when the model learns the training data too well and fails to generalize to new and unseen data. This can be seen in the question, where the validation accuracy is lower than the training accuracy, and both are lower than the human-level performance. Regularization is a way of adding some constraints or penalties to the model to reduce its complexity and prevent it from memorizing the training data.

Some common forms of regularization for image classification are:

Weight decay: Adding a term to the loss function that penalizes large weights in the model. This can help reduce the variance and noise in the model and make it more robust to small changes in the input.

Dropout: Randomly dropping out some units or connections in the model during training. This can help reduce the co-dependency among the units and make the model more resilient to missing or corrupted features.

Data augmentation: Artificially increasing the size and diversity of the training data by applying random transformations, such as cropping, flipping, rotating, scaling, etc. This can help the model learn more invariant and generalizable features and reduce the risk of overfitting to specific patterns in the training data.

The other options are not likely to fix the issue of overfitting, and may even worsen it:

A longer training time: This can lead to more overfitting, as the model will have more chances to fit the noise and details in the training data that are not relevant for the validation data.

Making the network larger: This can increase the model capacity and complexity, which can also lead to more overfitting, as the model will have more parameters to learn and adjust to the training data.

Using a different optimizer: This can affect the speed and stability of the training process, but not necessarily the generalization ability of the model. The choice of optimizer depends on the characteristics of the data and the model, and there is no guarantee that a different optimizer will prevent overfitting.

Reference:

Regularization (machine learning)

Image Classification: Regularization

How to Reduce Overfitting With Dropout Regularization in Keras

Question #51

Example Corp has an annual sale event from October to December. The company has sequential sales data from the past 15 years and wants to use Amazon ML to predict the sales for this year’s upcoming event.

Which method should Example Corp use to split the data into a training dataset and evaluation dataset?

A . Pre-split the data before uploading to Amazon S3
B . Have Amazon ML split the data randomly.
C . Have Amazon ML split the data sequentially.
D . Perform custom cross-validation on the data

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

A sequential split is a method of splitting data into training and evaluation datasets while preserving the order of the data records. This method is useful when the data has a temporal or sequential structure, and the order of the data matters for the prediction task. For example, if the data contains sales data for different months or years, and the goal is to predict the sales for the next month or year, a sequential split can ensure that the training data comes from the earlier period and the evaluation data comes from the later period. This can help avoid data leakage, which occurs when the training data contains information from the future that is not available at the time of prediction. A sequential split can also help evaluate the model performance on the most recent data, which may be more relevant and representative of the future data.

In this question, Example Corp has sequential sales data from the past 15 years and wants to use Amazon ML to predict the sales for this year’s upcoming annual sale event. A sequential split is the most appropriate method for splitting the data, as it can preserve the order of the data and prevent data leakage. For example, Example Corp can use the data from the first 14 years as the training dataset, and the data from the last year as the evaluation dataset. This way, the model can learn from the historical data and be tested on the most recent data.

Amazon ML provides an option to split the data sequentially when creating the training and evaluation data sources. To use this option, Example Corp can specify the percentage of the data to use for training and evaluation, and Amazon ML will use the first part of the data for training and the remaining part of the data for evaluation. For more information, see Splitting Your Data – Amazon Machine Learning.

Question #52

A company is running a machine learning prediction service that generates 100 TB of predictions every day A Machine Learning Specialist must generate a visualization of the daily precision-recall curve from the predictions, and forward a read-only version to the Business team.

Which solution requires the LEAST coding effort?

A . Run a daily Amazon EMR workflow to generate precision-recall data, and save the results in Amazon S3 Give the Business team read-only access to S3
B . Generate daily precision-recall data in Amazon QuickSight, and publish the results in a dashboard shared with the Business team
C . Run a daily Amazon EMR workflow to generate precision-recall data, and save the results in Amazon S3 Visualize the arrays in Amazon QuickSight, and publish them in a dashboard shared with the Business team
D . Generate daily precision-recall data in Amazon ES, and publish the results in a dashboard shared with the Business team.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

A precision-recall curve is a plot that shows the trade-off between the precision and recall of a binary classifier as the decision threshold is varied. It is a useful tool for evaluating and comparing the performance of different models. To generate a precision-recall curve, the following steps are needed:

Calculate the precision and recall values for different threshold values using the predictions and the true labels of the data.

Plot the precision values on the y-axis and the recall values on the x-axis for each threshold value. Optionally, calculate the area under the curve (AUC) as a summary metric of the model performance. Among the four options, option C requires the least coding effort to generate and share a visualization of the daily precision-recall curve from the predictions.

This option involves the following steps:

Run a daily Amazon EMR workflow to generate precision-recall data: Amazon EMR is a service that allows running big data frameworks, such as Apache Spark, on a managed cluster of EC2 instances. Amazon EMR can handle large-scale data processing and analysis, such as calculating the precision and recall values for different threshold values from 100 TB of predictions. Amazon EMR supports various languages, such as Python, Scala, and R, for writing the code to perform the calculations.

Amazon EMR also supports scheduling workflows using Apache Airflow or AWS Step Functions, which can automate the daily execution of the code.

Save the results in Amazon S3: Amazon S3 is a service that provides scalable, durable, and secure object storage. Amazon S3 can store the precision-recall data generated by Amazon EMR in a cost-effective and accessible way. Amazon S3 supports various data formats, such as CSV, JSON, or Parquet, for storing the data. Amazon S3 also integrates with other AWS services, such as Amazon QuickSight, for further processing and visualization of the data.

Visualize the arrays in Amazon QuickSight: Amazon QuickSight is a service that provides fast, easy-to-use, and interactive business intelligence and data visualization. Amazon QuickSight can connect to Amazon S3 as a data source and import the precision-recall data into a dataset. Amazon QuickSight can then create a line chart to plot the precision-recall curve from the dataset. Amazon QuickSight also supports calculating the AUC and adding it as an annotation to the chart.

Publish them in a dashboard shared with the Business team: Amazon QuickSight allows creating and publishing dashboards that contain one or more visualizations from the datasets. Amazon QuickSight also allows sharing the dashboards with other users or groups within the same AWS account or across different AWS accounts. The Business team can access the dashboard with read-only permissions and view the daily precision-recall curve from the predictions.

The other options require more coding effort than option C for the following reasons:

Option A: This option requires writing code to plot the precision-recall curve from the data stored in Amazon S3, as well as creating a mechanism to share the plot with the Business team. This can involve using additional libraries or tools, such as matplotlib, seaborn, or plotly, for creating the plot, and using email, web, or cloud services, such as AWS Lambda or Amazon SNS, for sharing the plot. Option B: This option requires transforming the predictions into a format that Amazon QuickSight can recognize and import as a data source, such as CSV, JSON, or Parquet. This can involve writing code to process and convert the predictions, as well as uploading them to a storage service, such as Amazon S3 or Amazon Redshift, that Amazon QuickSight can connect to.

Option D: This option requires writing code to generate precision-recall data in Amazon ES, as well as creating a dashboard to visualize the data. Amazon ES is a service that provides a fully managed Elasticsearch cluster, which is mainly used for search and analytics purposes. Amazon ES is not designed for generating precision-recall data, and it requires using a specific data format, such as JSON, for storing the data. Amazon ES also requires using a tool, such as Kibana, for creating and sharing the dashboard, which can involve additional configuration and customization steps.

Reference:

Precision-Recall

What Is Amazon EMR?

What Is Amazon S3?

[What Is Amazon QuickSight?]

[What Is Amazon Elasticsearch Service?]

Question #53

A Machine Learning Specialist has built a model using Amazon SageMaker built-in algorithms and is not getting expected accurate results The Specialist wants to use hyperparameter optimization to increase the model’s accuracy

Which method is the MOST repeatable and requires the LEAST amount of effort to achieve this?

A . Launch multiple training jobs in parallel with different hyperparameters
B . Create an AWS Step Functions workflow that monitors the accuracy in Amazon CloudWatch Logs
and relaunches the training job with a defined list of hyperparameters
C . Create a hyperparameter tuning job and set the accuracy as an objective metric.
D . Create a random walk in the parameter space to iterate through a range of values that should be used for each individual hyperparameter

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

A hyperparameter tuning job is a feature of Amazon SageMaker that allows automatically finding the best combination of hyperparameters for a machine learning model. Hyperparameters are high-level parameters that influence the learning process and the performance of the model, such as the learning rate, the number of layers, the regularization factor, etc. A hyperparameter tuning job works by launching multiple training jobs with different hyperparameters, evaluating the results using an objective metric, and choosing the next set of hyperparameters to try based on a search strategy. The objective metric is a measure of the quality of the model, such as accuracy, precision, recall, etc. The search strategy is a method of exploring the hyperparameter space, such as random search, grid search, or Bayesian optimization.

Among the four options, option C is the most repeatable and requires the least amount of effort to use hyperparameter optimization to increase the model’s accuracy. This option involves the following steps:

Create a hyperparameter tuning job: Amazon SageMaker provides an easy-to-use interface for creating a hyperparameter tuning job, either through the AWS Management Console, the AWS CLI, or the AWS SDKs. To create a hyperparameter tuning job, the Machine Learning Specialist needs to specify the following information:

The name and type of the algorithm to use, either a built-in algorithm or a custom algorithm. The ranges and types of the hyperparameters to tune, such as categorical, continuous, or integer. The name and type of the objective metric to optimize, such as accuracy, and whether to maximize or minimize it.

The resource limits for the tuning job, such as the maximum number of training jobs and the maximum parallel training jobs.

The input data channels and the output data location for the training jobs.

The configuration of the training instances, such as the instance type, the instance count, the volume size, etc.

Set the accuracy as an objective metric: To use accuracy as an objective metric, the Machine Learning Specialist needs to ensure that the training algorithm writes the accuracy value to a file

called metric_definitions in JSON format and prints it to stdout or stderr. For example, the file can contain the following content:

This means that the training algorithm prints a line like this:

Amazon SageMaker reads the accuracy value from the line and uses it to evaluate and compare the training jobs.

The other options are not as repeatable and require more effort than option C for the following reasons:

Option A: This option requires manually launching multiple training jobs in parallel with different hyperparameters, which can be tedious and error-prone. It also requires manually monitoring and comparing the results of the training jobs, which can be time-consuming and subjective.

Option B: This option requires writing code to create an AWS Step Functions workflow that monitors the accuracy in Amazon CloudWatch Logs and relaunches the training job with a defined list of hyperparameters, which can be complex and challenging. It also requires maintaining and updating the list of hyperparameters, which can be inefficient and suboptimal.

Option D: This option requires writing code to create a random walk in the parameter space to

iterate through a range of values that should be used for each individual hyperparameter, which can

be unreliable and unpredictable. It also requires defining and implementing a stopping criterion,

which can be arbitrary and inconsistent.

Reference: Automatic Model Tuning – Amazon SageMaker

Define Metrics to Monitor Model Performance

Question #54

IT leadership wants Jo transition a company’s existing machine learning data storage environment to AWS as a temporary ad hoc solution The company currently uses a custom software process that heavily leverages SOL as a query language and exclusively stores generated csv documents for machine learning.

The ideal state for the company would be a solution that allows it to continue to use the current workforce of SQL experts The solution must also support the storage of csv and JSON files, and be able to query over semi-structured data.

The following are high priorities for the company:

• Solution simplicity

• Fast development time

• Low cost

• High flexibility

What technologies meet the company’s requirements?

A . Amazon S3 and Amazon Athena
B . Amazon Redshift and AWS Glue
C . Amazon DynamoDB and DynamoDB Accelerator (DAX)
D . Amazon RDS and Amazon ES

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Amazon S3 and Amazon Athena are technologies that meet the company’s requirements for a temporary ad hoc solution for machine learning data storage and query. Amazon S3 and Amazon Athena have the following features and benefits:

Amazon S3 is a service that provides scalable, durable, and secure object storage for any type of data. Amazon S3 can store csv and JSON files, as well as other formats, and can handle large volumes of data with high availability and performance. Amazon S3 also integrates with other AWS services, such as Amazon Athena, for further processing and analysis of the data.

Amazon Athena is a service that allows querying data stored in Amazon S3 using standard SQL. Amazon Athena can query over semi-structured data, such as JSON, as well as structured data, such as csv, without requiring any loading or transformation. Amazon Athena is serverless, meaning that there is no infrastructure to manage and users only pay for the queries they run. Amazon Athena also supports the use of AWS Glue Data Catalog, which is a centralized metadata repository that can store and manage the schema and partition information of the data in Amazon S3.

Using Amazon S3 and Amazon Athena, the company can achieve the following high priorities: Solution simplicity: Amazon S3 and Amazon Athena are easy to use and require minimal configuration and maintenance. The company can simply upload the csv and JSON files to Amazon S3 and use Amazon Athena to query them using SQL. The company does not need to worry about provisioning, scaling, or managing any servers or clusters.

Fast development time: Amazon S3 and Amazon Athena can enable the company to quickly access and analyze the data without any data preparation or loading. The company can use the existing workforce of SQL experts to write and run queries on Amazon Athena and get results in seconds or minutes.

Low cost: Amazon S3 and Amazon Athena are cost-effective and offer pay-as-you-go pricing models. Amazon S3 charges based on the amount of storage used and the number of requests made. Amazon Athena charges based on the amount of data scanned by the queries. The company can also reduce the costs by using compression, encryption, and partitioning techniques to optimize the data storage and query performance.

High flexibility: Amazon S3 and Amazon Athena are flexible and can support various data types, formats, and sources. The company can store and query any type of data in Amazon S3, such as csv, JSON, Parquet, ORC, etc. The company can also query data from multiple sources in Amazon S3, such as data lakes, data warehouses, log files, etc.

The other options are not as suitable as option A for the company’s requirements for the following reasons:

Option B: Amazon Redshift and AWS Glue are technologies that can be used for data warehousing and data integration, but they are not ideal for a temporary ad hoc solution. Amazon Redshift is a service that provides a fully managed, petabyte-scale data warehouse that can run complex analytical queries using SQL. AWS Glue is a service that provides a fully managed extract, transform, and load (ETL) service that can prepare and load data for analytics. However, using Amazon Redshift and AWS Glue would require more effort and cost than using Amazon S3 and Amazon Athena. The company would need to load the data from Amazon S3 to Amazon Redshift using AWS Glue, which can take time and incur additional charges. The company would also need to manage the capacity and performance of the Amazon Redshift cluster, which can be complex and expensive.

Option C: Amazon DynamoDB and DynamoDB Accelerator (DAX) are technologies that can be used for fast and scalable NoSQL database and caching, but they are not suitable for the company’s data storage and query needs. Amazon DynamoDB is a service that provides a fully managed, key-value and document database that can deliver single-digit millisecond performance at any scale.

DynamoDB Accelerator (DAX) is a service that provides a fully managed, in-memory cache for DynamoDB that can improve the read performance by up to 10 times. However, using Amazon DynamoDB and DAX would not allow the company to continue to use SQL as a query language, as Amazon DynamoDB does not support SQL. The company would need to use the DynamoDB API or the AWS SDKs to access and query the data, which can require more coding and learning effort. The

company would also need to transform the csv and JSON files into DynamoDB items, which can involve additional processing and complexity.

Option D: Amazon RDS and Amazon ES are technologies that can be used for relational database and search and analytics, but they are not optimal for the company’s data storage and query scenario. Amazon RDS is a service that provides a fully managed, relational database that supports various database engines, such as MySQL, PostgreSQL, Oracle, etc. Amazon ES is a service that provides a fully managed, Elasticsearch cluster, which is mainly used for search and analytics purposes. However, using Amazon RDS and Amazon ES would not be as simple and cost-effective as using Amazon S3 and Amazon Athena. The company would need to load the data from Amazon S3 to Amazon RDS, which can take time and incur additional charges. The company would also need to manage the capacity and performance of the Amazon RDS and Amazon ES clusters, which can be complex and expensive. Moreover, Amazon RDS and Amazon ES are not designed to handle semi-structured data, such as JSON, as well as Amazon S3 and Amazon Athena.

Reference:

Amazon S3

Amazon Athena

Amazon Redshift

AWS Glue

Amazon DynamoDB

[DynamoDB Accelerator (DAX)]

[Amazon RDS]

[Amazon ES]

Question #55

A Machine Learning Specialist is working for a credit card processing company and receives an unbalanced dataset containing credit card transactions. It contains 99,000 valid transactions and 1,000 fraudulent transactions The Specialist is asked to score a model that was run against the dataset The Specialist has been advised that identifying valid transactions is equally as important as identifying fraudulent transactions

What metric is BEST suited to score the model?

A . Precision
B . Recall
C . Area Under the ROC Curve (AUC)
D . Root Mean Square Error (RMSE)

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Area Under the ROC Curve (AUC) is a metric that is best suited to score the model for the given scenario. AUC is a measure of the performance of a binary classifier, such as a model that predicts whether a credit card transaction is valid or fraudulent. AUC is calculated based on the Receiver Operating Characteristic (ROC) curve, which is a plot that shows the trade-off between the true positive rate (TPR) and the false positive rate (FPR) of the classifier as the decision threshold is varied. The TPR, also known as recall or sensitivity, is the proportion of actual positive cases (fraudulent transactions) that are correctly predicted as positive by the classifier. The FPR, also known as the fall-out, is the proportion of actual negative cases (valid transactions) that are incorrectly predicted as positive by the classifier. The ROC curve illustrates how well the classifier can distinguish between the two classes, regardless of the class distribution or the error costs. A perfect classifier would have a TPR of 1 and an FPR of 0 for all thresholds, resulting in a ROC curve that goes from the bottom left to the top left and then to the top right of the plot. A random classifier would have a TPR and an FPR that are equal for all thresholds, resulting in a ROC curve that goes from the bottom left to the top right of the plot along the diagonal line. AUC is the area under the ROC curve, and it ranges from 0 to 1. A higher AUC indicates a better classifier, as it means that the classifier has a higher TPR and a lower FPR for all thresholds. AUC is a useful metric for imbalanced classification problems, such as the credit card transaction dataset, because it is insensitive to the class imbalance and the error costs. AUC can capture the overall performance of the classifier across all possible scenarios, and it can be used to compare different classifiers based on their ROC curves.

The other options are not as suitable as AUC for the given scenario for the following reasons: Precision: Precision is the proportion of predicted positive cases (fraudulent transactions) that are actually positive. Precision is a useful metric when the cost of a false positive is high, such as in spam detection or medical diagnosis. However, precision is not a good metric for imbalanced classification problems, because it can be misleadingly high when the positive class is rare. For example, a classifier that predicts all transactions as valid would have a precision of 0, but a very high accuracy of 99%. Precision is also dependent on the decision threshold and the error costs, which may vary for different scenarios.

Recall: Recall is the same as the TPR, and it is the proportion of actual positive cases (fraudulent transactions) that are correctly predicted as positive by the classifier. Recall is a useful metric when the cost of a false negative is high, such as in fraud detection or cancer diagnosis. However, recall is not a good metric for imbalanced classification problems, because it can be misleadingly low when the positive class is rare. For example, a classifier that predicts all transactions as fraudulent would have a recall of 1, but a very low accuracy of 1%. Recall is also dependent on the decision threshold and the error costs, which may vary for different scenarios.

Root Mean Square Error (RMSE): RMSE is a metric that measures the average difference between the predicted and the actual values. RMSE is a useful metric for regression problems, where the goal is to predict a continuous value, such as the price of a house or the temperature of a city. However, RMSE is not a good metric for classification problems, where the goal is to predict a discrete value, such as the class label of a transaction. RMSE is not meaningful for classification problems, because it does not capture the accuracy or the error costs of the predictions.

Reference:

ROC Curve and AUC

How and When to Use ROC Curves and Precision-Recall Curves for Classification in Python Precision-Recall

Root Mean Squared Error

Question #56

A bank’s Machine Learning team is developing an approach for credit card fraud detection The company has a large dataset of historical data labeled as fraudulent. The goal is to build a model to take the information from new transactions and predict whether each transaction is fraudulent or not

Which built-in Amazon SageMaker machine learning algorithm should be used for modeling this problem?

A . Seq2seq
B . XGBoost
C . K-means
D . Random Cut Forest (RCF)

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

XGBoost is a built-in Amazon SageMaker machine learning algorithm that should be used for modeling the credit card fraud detection problem. XGBoost is an algorithm that implements a scalable and distributed gradient boosting framework, which is a popular and effective technique for supervised learning problems. Gradient boosting is a method of combining multiple weak learners, such as decision trees, into a strong learner, by iteratively fitting new models to the residual errors of the previous models and adding them to the ensemble. XGBoost can handle various types of data, such as numerical, categorical, or text, and can perform both regression and classification tasks. XGBoost also supports various features and optimizations, such as regularization, missing value handling, parallelization, and cross-validation, that can improve the performance and efficiency of the algorithm.

XGBoost is suitable for the credit card fraud detection problem for the following reasons:

The problem is a binary classification problem, where the goal is to predict whether a transaction is fraudulent or not, based on the information from new transactions. XGBoost can perform binary classification by using a logistic regression objective function and outputting the probability of the positive class (fraudulent) for each transaction.

The problem involves a large and imbalanced dataset of historical data labeled as fraudulent. XGBoost can handle large-scale and imbalanced data by using distributed and parallel computing, as well as techniques such as weighted sampling, class weighting, or stratified sampling, to balance the classes and reduce the bias towards the majority class (non-fraudulent).

The problem requires a high accuracy and precision for detecting fraudulent transactions, as well as a low false positive rate for avoiding false alarms. XGBoost can achieve high accuracy and precision by using gradient boosting, which can learn complex and non-linear patterns from the data and reduce the variance and overfitting of the model. XGBoost can also achieve a low false positive rate by using regularization, which can reduce the complexity and noise of the model and prevent it from fitting spurious signals in the data.

The other options are not as suitable as XGBoost for the credit card fraud detection problem for the following reasons:

Seq2seq: Seq2seq is an algorithm that implements a sequence-to-sequence model, which is a type of neural network model that can map an input sequence to an output sequence. Seq2seq is mainly used for natural language processing tasks, such as machine translation, text summarization, or dialogue generation. Seq2seq is not suitable for the credit card fraud detection problem, because the problem is not a sequence-to-sequence task, but a binary classification task. The input and output of the problem are not sequences of words or tokens, but vectors of features and labels.

K-means: K-means is an algorithm that implements a clustering technique, which is a type of unsupervised learning method that can group similar data points into clusters. K-means is mainly used for exploratory data analysis, dimensionality reduction, or anomaly detection. K-means is not suitable for the credit card fraud detection problem, because the problem is not a clustering task, but a classification task. The problem requires using the labeled data to train a model that can predict the labels of new data, not finding the optimal number of clusters or the cluster memberships of the data.

Random Cut Forest (RCF): RCF is an algorithm that implements an anomaly detection technique, which is a type of unsupervised learning method that can identify data points that deviate from the normal behavior or distribution of the data. RCF is mainly used for detecting outliers, frauds, or faults in the data. RCF is not suitable for the credit card fraud detection problem, because the problem is not an anomaly detection task, but a classification task. The problem requires using the labeled data to train a model that can predict the labels of new data, not finding the anomaly scores or the anomalous data points in the data.

Reference:

XGBoost Algorithm

Use XGBoost for Binary Classification with Amazon SageMaker

Seq2seq Algorithm

K-means Algorithm

[Random Cut Forest Algorithm]

Question #57

While working on a neural network project, a Machine Learning Specialist discovers thai some features in the data have very high magnitude resulting in this data being weighted more in the cost function.

What should the Specialist do to ensure better convergence during backpropagation?

A . Dimensionality reduction
B . Data normalization
C . Model regulanzation
D . Data augmentation for the minority class

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Data normalization is a data preprocessing technique that scales the features to a common range, such as [0, 1] or [-1, 1]. This helps reduce the impact of features with high magnitude on the cost function and improves the convergence during backpropagation. Data normalization can be done using different methods, such as min-max scaling, z-score standardization, or unit vector normalization. Data normalization is different from dimensionality reduction, which reduces the number of features; model regularization, which adds a penalty term to the cost function to prevent overfitting; and data augmentation, which increases the amount of data by creating synthetic samples.

Reference: Data processing options for AI/ML | AWS Machine Learning Blog Data preprocessing – Machine Learning Lens

How to Normalize Data Using scikit-learn in Python Normalization | Machine Learning | Google for Developers

Question #58

An online reseller has a large, multi-column dataset with one column missing 30% of its data A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data.

Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?

A . Listwise deletion
B . Last observation carried forward
C . Multiple imputation
D . Mean substitution

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Multiple imputation is a technique that uses machine learning to generate multiple plausible values for each missing value in a dataset, based on the observed data and the relationships among the variables. Multiple imputation preserves the integrity of the dataset by accounting for the uncertainty and variability of the missing data, and avoids the bias and loss of information that may result from other methods, such as listwise deletion, last observation carried forward, or mean substitution. Multiple imputation can improve the accuracy and validity of statistical analysis and machine learning models that use the imputed dataset.

Reference: Managing missing values in your target and related datasets with automated imputation support in Amazon Forecast

Imputation by feature importance (IBFI): A methodology to impute missing data in large datasets Multiple Imputation by Chained Equations (MICE) Explained

Question #59

A Machine Learning Specialist discover the following statistics while experimenting on a model.

What can the Specialist from the experiments?

A . The model In Experiment 1 had a high variance error that was reduced in Experiment 3 by regularization Experiment 2 shows that there is minimal bias error in Experiment 1
B . The model in Experiment 1 had a high bias error that was reduced in Experiment 3 by regularization Experiment 2 shows that there is minimal variance error in Experiment 1
C . The model in Experiment 1 had a high bias error and a high variance error that were reduced in Experiment 3 by regularization Experiment 2 shows that high bias cannot be reduced by increasing layers and neurons in the model
D . The model in Experiment 1 had a high random noise error that was reduced in Experiment 3 by regularization Experiment 2 shows that random noise cannot be reduced by increasing layers and neurons in the model

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The model in Experiment 1 had a high variance error because it performed well on the training data (train error = 5%) but poorly on the test data (test error = 8%). This indicates that the model was overfitting the training data and not generalizing well to new data. The model in Experiment 3 had a lower variance error because it performed similarly on the training data (train error = 5.1%) and the test data (test error = 5.4%). This indicates that the model was more robust and less sensitive to the fluctuations in the training data. The model in Experiment 3 achieved this improvement by implementing regularization, which is a technique that reduces the complexity of the model and prevents overfitting by adding a penalty term to the loss function. The model in Experiment 2 had a minimal bias error because it performed similarly on the training data (train error = 5.2%) and the test data (test error = 5.7%) as the model in Experiment 1. This indicates that the model was not underfitting the data and capturing the true relationship between the input and output variables. The model in Experiment 2 increased the number of layers and neurons in the model, which is a way to increase the complexity and flexibility of the model. However, this did not improve the performance of the model, as the variance error remained high. This shows that increasing the complexity of the model is not always the best way to reduce the bias error, and may even increase the variance error if the model becomes too complex for the data.

Reference: Bias Variance Tradeoff – Clearly Explained – Machine Learning Plus The Bias-Variance Trade-off in Machine Learning – Stack Abuse

Question #60

A Machine Learning Specialist needs to be able to ingest streaming data and store it in Apache Parquet files for exploration and analysis.

Which of the following services would both ingest and store this data in the correct format?

A . AWSDMS
B . Amazon Kinesis Data Streams
C . Amazon Kinesis Data Firehose
D . Amazon Kinesis Data Analytics

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Amazon Kinesis Data Firehose is a service that can ingest streaming data and store it in various destinations, including Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk. Amazon Kinesis Data Firehose can also convert the incoming data to Apache Parquet or Apache ORC format before storing it in Amazon S3. This can reduce the storage cost and improve the performance of analytical queries on the data. Amazon Kinesis Data Firehose supports various data sources, such as Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka, AWS IoT, and custom applications. Amazon Kinesis Data Firehose can also apply data transformation and compression using AWS Lambda functions.

AWSDMS is not a valid service name. AWS Database Migration Service (AWS DMS) is a service that can migrate data from various sources to various targets, but it does not support streaming data or Parquet format.

Amazon Kinesis Data Streams is a service that can ingest and process streaming data in real time, but it does not store the data in any destination. Amazon Kinesis Data Streams can be integrated with Amazon Kinesis Data Firehose to store the data in Parquet format.

Amazon Kinesis Data Analytics is a service that can analyze streaming data using SQL or Apache Flink,

but it does not store the data in any destination. Amazon Kinesis Data Analytics can be integrated with Amazon Kinesis Data Firehose to store the data in Parquet format.

Reference: Amazon Kinesis Data Firehose – Amazon Web Services

What Is Amazon Kinesis Data Firehose? – Amazon Kinesis Data Firehose Amazon Kinesis Data Firehose FAQs – Amazon Web Services

Question #61

A Machine Learning Specialist needs to move and transform data in preparation for training Some of the data needs to be processed in near-real time and other data can be moved hourly There are existing Amazon EMR MapReduce jobs to clean and feature engineering to perform on the data.

Which of the following services can feed data to the MapReduce jobs? (Select TWO)

A . AWSDMS
B . Amazon Kinesis
C . AWS Data Pipeline
D . Amazon Athena
E . Amazon ES

Reveal Solution Hide Solution

Correct Answer: B, C
B, C

Explanation:

Amazon Kinesis and AWS Data Pipeline are two services that can feed data to the Amazon EMR MapReduce jobs. Amazon Kinesis is a service that can ingest, process, and analyze streaming data in real time. Amazon Kinesis can be integrated with Amazon EMR to run MapReduce jobs on streaming data sources, such as web logs, social media, IoT devices, and clickstreams. Amazon Kinesis can handle data that needs to be processed in near-real time, such as for anomaly detection, fraud detection, or dashboarding. AWS Data Pipeline is a service that can orchestrate and automate data movement and transformation across various AWS services and on-premises data sources. AWS Data Pipeline can be integrated with Amazon EMR to run MapReduce jobs on batch data sources, such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Redshift. AWS Data Pipeline can handle data that can be moved hourly, such as for data warehousing, reporting, or machine learning. AWSDMS is not a valid service name. AWS Database Migration Service (AWS DMS) is a service that can migrate data from various sources to various targets, but it does not support streaming data or MapReduce jobs.

Amazon Athena is a service that can query data stored in Amazon S3 using standard SQL, but it does not feed data to Amazon EMR or run MapReduce jobs.

Amazon ES is a service that provides a fully managed Elasticsearch cluster, which can be used for search, analytics, and visualization, but it does not feed data to Amazon EMR or run MapReduce jobs.

Reference: Using Amazon Kinesis with Amazon EMR – Amazon EMR

AWS Data Pipeline – Amazon Web Services

Using AWS Data Pipeline to Run Amazon EMR Jobs – AWS Data Pipeline

Question #62

An insurance company is developing a new device for vehicles that uses a camera to observe drivers’

behavior and alert them when they appear distracted. The company created approximately 10,000 training images in a controlled environment that a Machine Learning Specialist will use to train and evaluate machine learning models

During the model evaluation the Specialist notices that the training error rate diminishes faster as the number of epochs increases and the model is not accurately inferring on the unseen test images.

Which of the following should be used to resolve this issue? (Select TWO)

A . Add vanishing gradient to the model
B . Perform data augmentation on the training data
C . Make the neural network architecture complex.
D . Use gradient checking in the model
E . Add L2 regularization to the model

Reveal Solution Hide Solution

Correct Answer: B, E
B, E

Explanation:

The issue described in the question is a sign of overfitting, which is a common problem in machine learning when the model learns the noise and details of the training data too well and fails to generalize to new and unseen data. Overfitting can result in a low training error rate but a high test error rate, which indicates poor performance and validity of the model. There are several techniques that can be used to prevent or reduce overfitting, such as data augmentation and regularization. Data augmentation is a technique that applies various transformations to the original training data, such as rotation, scaling, cropping, flipping, adding noise, changing brightness, etc., to create new and diverse data samples. Data augmentation can increase the size and diversity of the training data, which can help the model learn more features and patterns and reduce the variance of the model. Data augmentation is especially useful for image data, as it can simulate different scenarios and perspectives that the model may encounter in real life. For example, in the question, the device uses a camera to observe drivers’ behavior, so data augmentation can help the model deal with different lighting conditions, angles, distances, etc. Data augmentation can be done using various libraries and frameworks, such as TensorFlow, PyTorch, Keras, OpenCV, etc12

Regularization is a technique that adds a penalty term to the model’s objective function, which is typically based on the model’s parameters. Regularization can reduce the complexity and flexibility of the model, which can prevent overfitting by avoiding learning the noise and details of the training data. Regularization can also improve the stability and robustness of the model, as it can reduce the sensitivity of the model to small fluctuations in the data. There are different types of regularization, such as L1, L2, dropout, etc., but they all have the same goal of reducing overfitting. L2 regularization, also known as weight decay or ridge regression, is one of the most common and effective regularization techniques. L2 regularization adds the squared norm of the model’s parameters multiplied by a regularization parameter (lambda) to the model’s objective function. L2 regularization can shrink the model’s parameters towards zero, which can reduce the variance of the model and improve the generalization ability of the model. L2 regularization can be implemented using various libraries and frameworks, such as TensorFlow, PyTorch, Keras, Scikit-learn, etc34

The other options are not valid or relevant for resolving the issue of overfitting. Adding vanishing gradient to the model is not a technique, but a problem that occurs when the gradient of the model’s objective function becomes very small and the model stops learning. Making the neural network architecture complex is not a solution, but a possible cause of overfitting, as a complex model can have more parameters and more flexibility to fit the training data too well. Using gradient checking in the model is not a technique, but a debugging method that verifies the correctness of the gradient computation in the model. Gradient checking is not related to overfitting, but to the implementation of the model.

Question #63

The Chief Editor for a product catalog wants the Research and Development team to build a machine learning system that can be used to detect whether or not individuals in a collection of images are wearing the company’s retail brand The team has a set of training data

Which machine learning algorithm should the researchers use that BEST meets their requirements?

A . Latent Dirichlet Allocation (LDA)
B . Recurrent neural network (RNN)
C . K-means
D . Convolutional neural network (CNN)

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

A convolutional neural network (CNN) is a type of machine learning algorithm that is suitable for image classification tasks. A CNN consists of multiple layers that can extract features from images and learn to recognize patterns and objects. A CNN can also use transfer learning to leverage pre-trained models that have been trained on large-scale image datasets, such as ImageNet, and fine-tune them for specific tasks, such as detecting the company’s retail brand. A CNN can achieve high accuracy and performance for image classification problems, as it can handle complex and diverse images and reduce the dimensionality and noise of the input data. A CNN can be implemented using various frameworks and libraries, such as TensorFlow, PyTorch, Keras, MXNet, etc12

The other options are not valid or relevant for the image classification task. Latent Dirichlet Allocation (LDA) is a type of machine learning algorithm that is suitable for topic modeling tasks. LDA can discover the hidden topics and their proportions in a collection of text documents, such as news articles, tweets, reviews, etc. LDA is not applicable for image data, as it requires textual input and output. LDA can be implemented using various frameworks and libraries, such as Gensim, Scikit-learn, Mallet, etc34

Recurrent neural network (RNN) is a type of machine learning algorithm that is suitable for sequential data tasks. RNN can process and generate data that has temporal or sequential dependencies, such as natural language, speech, audio, video, etc. RNN is not optimal for image data, as it does not capture the spatial features and relationships of the pixels. RNN can be implemented using various frameworks and libraries, such as TensorFlow, PyTorch, Keras, MXNet, etc.

K-means is a type of machine learning algorithm that is suitable for clustering tasks. K-means can partition a set of data points into a predefined number of clusters, based on the similarity and distance between the data points. K-means is not suitable for image classification tasks, as it does not learn to label the images or detect the objects of interest. K-means can be implemented using various frameworks and libraries, such as Scikit-learn, TensorFlow, PyTorch, etc.

Question #64

A Machine Learning Specialist kicks off a hyperparameter tuning job for a tree-based ensemble model using Amazon SageMaker with Area Under the ROC Curve (AUC) as the objective metric This workflow will eventually be deployed in a pipeline that retrains and tunes hyperparameters each night to model click-through on data that goes stale every 24 hours

With the goal of decreasing the amount of time it takes to train these models, and ultimately to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s)

Which visualization will accomplish this?

A . A histogram showing whether the most important input feature is Gaussian.
B . A scatter plot with points colored by target variable that uses (-Distributed Stochastic Neighbor Embedding (I-SNE) to visualize the large number of input variables in an easier-to-read dimension.
C . A scatter plot showing (he performance of the objective metric over each training iteration
D . A scatter plot showing the correlation between maximum tree depth and the objective metric.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

A scatter plot showing the correlation between maximum tree depth and the objective metric is a visualization that can help the Machine Learning Specialist reconfigure the input hyperparameter range(s) for the tree-based ensemble model. A scatter plot is a type of graph that displays the relationship between two variables using dots, where each dot represents one observation. A scatter plot can show the direction, strength, and shape of the correlation between the variables, as well as any outliers or clusters. In this case, the scatter plot can show how the maximum tree depth, which is a hyperparameter that controls the complexity and depth of the decision trees in the ensemble model, affects the AUC, which is the objective metric that measures the performance of the model in terms of the trade-off between true positive rate and false positive rate. By looking at the scatter plot, the Machine Learning Specialist can see if there is a positive, negative, or no correlation between the maximum tree depth and the AUC, and how strong or weak the correlation is. The Machine Learning Specialist can also see if there is an optimal value or range of values for the maximum tree depth that maximizes the AUC, or if there is a point of diminishing returns or overfitting where increasing the maximum tree depth does not improve or even worsens the AUC. Based on the scatter plot, the Machine Learning Specialist can reconfigure the input hyperparameter range(s) for the maximum tree depth to focus on the values that yield the best AUC, and avoid the values that result in poor AUC. This can decrease the amount of time and cost it takes to train the model, as the hyperparameter tuning job can explore fewer and more promising combinations of values. A scatter plot can be created using various tools and libraries, such as Matplotlib, Seaborn, Plotly, etc12

The other options are not valid or relevant for reconfiguring the input hyperparameter range(s) for the tree-based ensemble model. A histogram showing whether the most important input feature is Gaussian is a visualization that can help the Machine Learning Specialist understand the distribution and shape of the input data, but not the hyperparameters. A histogram is a type of graph that displays the frequency or count of values in a single variable using bars, where each bar represents a bin or interval of values. A histogram can show if the variable is symmetric, skewed, or multimodal, and if it follows a normal or Gaussian distribution, which is a bell-shaped curve that is often assumed by many machine learning algorithms. In this case, the histogram can show if the most important input feature, which is a variable that has the most influence or predictive power on the output variable, is Gaussian or not. However, this does not help the Machine Learning Specialist reconfigure the input hyperparameter range(s) for the tree-based ensemble model, as the input feature is not a hyperparameter that can be tuned or optimized. A histogram can be created using various tools and libraries, such as Matplotlib, Seaborn, Plotly, etc34

A scatter plot with points colored by target variable that uses t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the large number of input variables in an easier-to-read dimension is a visualization that can help the Machine Learning Specialist understand the structure and clustering of the input data, but not the hyperparameters. t-SNE is a technique that can reduce the dimensionality of high-dimensional data, such as images, text, or gene expression, and project it onto a lower-dimensional space, such as two or three dimensions, while preserving the local similarities and distances between the data points. t-SNE can help visualize and explore the patterns and relationships in the data, such as the clusters, outliers, or separability of the classes. In this case, the scatter plot can show how the input variables, which are the features or predictors of the output variable, are mapped onto a two-dimensional space using t-SNE, and how the points are colored by the target variable, which is the output or response variable that the model tries to predict. However, this does not help the Machine Learning Specialist reconfigure the input hyperparameter range(s) for the tree-based ensemble model, as the input variables and the target variable are not hyperparameters that can be tuned or optimized. A scatter plot with t-SNE can be created using various tools and libraries, such as Scikit-learn, TensorFlow, PyTorch, etc5

A scatter plot showing the performance of the objective metric over each training iteration is a visualization that can help the Machine Learning Specialist understand the learning curve and convergence of the model, but not the hyperparameters. A scatter plot is a type of graph that displays the relationship between two variables using dots, where each dot represents one observation. A scatter plot can show the direction, strength, and shape of the correlation between the variables, as well as any outliers or clusters. In this case, the scatter plot can show how the objective metric, which is the performance measure that the model tries to optimize, changes over each training iteration, which is the number of times that the model updates its parameters using a batch of data. A scatter plot can show if the objective metric improves, worsens, or stagnates over time, and if the model converges to a stable value or oscillates or diverges. However, this does not help the Machine Learning Specialist reconfigure the input hyperparameter range(s) for the tree-based ensemble model, as the objective metric and the training iteration are not hyperparameters that can be tuned or optimized. A scatter plot can be created using various tools and libraries, such as Matplotlib, Seaborn, Plotly, etc.

Question #65

A Machine Learning Specialist is configuring automatic model tuning in Amazon SageMaker.

When using the hyperparameter optimization feature, which of the following guidelines should be followed to improve optimization? Choose the maximum number of hyperparameters supported by

A . Amazon SageMaker to search the largest number of combinations possible
B . Specify a very large hyperparameter range to allow Amazon SageMaker to cover every possible value.
C . Use log-scaled hyperparameters to allow the hyperparameter space to be searched as quickly as possible
D . Execute only one hyperparameter tuning job at a time and improve tuning through successive rounds of experiments

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Using log-scaled hyperparameters is a guideline that can improve the automatic model tuning in Amazon SageMaker. Log-scaled hyperparameters are hyperparameters that have values that span several orders of magnitude, such as learning rate, regularization parameter, or number of hidden units. Log-scaled hyperparameters can be specified by using a log-uniform distribution, which assigns equal probability to each order of magnitude within a range. For example, a log-uniform distribution between 0.001 and 1000 can sample values such as 0.001, 0.01, 0.1, 1, 10, 100, or 1000 with equal probability. Using log-scaled hyperparameters can allow the hyperparameter optimization feature to search the hyperparameter space more efficiently and effectively, as it can explore different scales of values and avoid sampling values that are too small or too large. Using log-scaled hyperparameters can also help avoid numerical issues, such as underflow or overflow, that may occur when using linear-scaled hyperparameters. Using log-scaled hyperparameters can be done by setting the ScalingType parameter to Logarithmic when defining the hyperparameter ranges in Amazon SageMaker12

The other options are not valid or relevant guidelines for improving the automatic model tuning in Amazon SageMaker. Choosing the maximum number of hyperparameters supported by Amazon SageMaker to search the largest number of combinations possible is not a good practice, as it can increase the time and cost of the tuning job and make it harder to find the optimal values. Amazon SageMaker supports up to 20 hyperparameters for tuning, but it is recommended to choose only the most important and influential hyperparameters for the model and algorithm, and use default or fixed values for the rest3 Specifying a very large hyperparameter range to allow Amazon SageMaker to cover every possible value is not a good practice, as it can result in sampling values that are irrelevant or impractical for the model and algorithm, and waste the tuning budget. It is recommended to specify a reasonable and realistic hyperparameter range based on the prior knowledge and experience of the model and algorithm, and use the results of the tuning job to refine the range if needed4 Executing only one hyperparameter tuning job at a time and improving tuning through successive rounds of experiments is not a good practice, as it can limit the exploration and exploitation of the hyperparameter space and make the tuning process slower and less efficient. It is recommended to use parallelism and concurrency to run multiple training jobs simultaneously and leverage the Bayesian optimization algorithm that Amazon SageMaker uses to guide the search for the best hyperparameter values5

Question #66

A large mobile network operating company is building a machine learning model to predict customers who are likely to unsubscribe from the service. The company plans to offer an incentive for these customers as the cost of churn is far greater than the cost of the incentive.

The model produces the following confusion matrix after evaluating on a test dataset of 100 customers:

Based on the model evaluation results, why is this a viable model for production?

A . The model is 86% accurate and the cost incurred by the company as a result of false negatives is
less than the false positives.
B . The precision of the model is 86%, which is less than the accuracy of the model.
C . The model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives.
D . The precision of the model is 86%, which is greater than the accuracy of the model.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Based on the model evaluation results, this is a viable model for production because the model is 86% accurate and the cost incurred by the company as a result of false positives is less than the false negatives. The accuracy of the model is the proportion of correct predictions out of the total predictions, which can be calculated by adding the true positives and true negatives and dividing by the total number of observations. In this case, the accuracy of the model is (10 + 76) / 100 = 0.86, which means that the model correctly predicted 86% of the customers’ churn status. The cost incurred by the company as a result of false positives and false negatives is the loss or damage that the company suffers when the model makes incorrect predictions. A false positive is when the model predicts that a customer will churn, but the customer actually does not churn. A false negative is when the model predicts that a customer will not churn, but the customer actually churns. In this case, the cost of a false positive is the incentive that the company offers to the customer who is predicted to churn, which is a relatively low cost. The cost of a false negative is the revenue that the company loses when the customer churns, which is a relatively high cost. Therefore, the cost of a false positive is less than the cost of a false negative, and the company would prefer to have more false positives than false negatives. The model has 10 false positives and 4 false negatives, which means that the company’s cost is lower than if the model had more false negatives and fewer false positives.

Question #67

A Machine Learning Specialist is designing a system for improving sales for a company. The objective is to use the large amount of information the company has on users’ behavior and product preferences to predict which products users would like based on the users’ similarity to other users.

What should the Specialist do to meet this objective?

A . Build a content-based filtering recommendation engine with Apache Spark ML on Amazon EMR.
B . Build a collaborative filtering recommendation engine with Apache Spark ML on Amazon EMR.
C . Build a model-based filtering recommendation engine with Apache Spark ML on Amazon EMR.
D . Build a combinative filtering recommendation engine with Apache Spark ML on Amazon EMR.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

A collaborative filtering recommendation engine is a type of machine learning system that can improve sales for a company by using the large amount of information the company has on users’ behavior and product preferences to predict which products users would like based on the users’ similarity to other users. A collaborative filtering recommendation engine works by finding the users who have similar ratings or preferences for the products, and then recommending the products that the similar users have liked but the target user has not seen or rated. A collaborative filtering recommendation engine can leverage the collective wisdom of the users and discover the hidden patterns and associations among the products and the users. A collaborative filtering recommendation engine can be implemented using Apache Spark ML on Amazon EMR, which are two services that can handle large-scale data processing and machine learning tasks. Apache Spark ML is a library that provides various tools and algorithms for machine learning, such as classification, regression, clustering, recommendation, etc. Apache Spark ML can run on Amazon EMR, which is a service that provides a managed cluster platform that simplifies running big data frameworks, such as Apache Spark, on AWS. Apache Spark ML on Amazon EMR can build a collaborative filtering recommendation engine using the Alternating Least Squares (ALS) algorithm, which is a matrix factorization technique that can learn the latent factors that represent the users and the products, and then use them to predict the ratings or preferences of the users for the products. Apache Spark ML on Amazon EMR can also support both explicit feedback, such as ratings or reviews, and implicit feedback, such as views or clicks, for building a collaborative filtering recommendation engine12

Question #68

A Data Engineer needs to build a model using a dataset containing customer credit card information.

How can the Data Engineer ensure the data remains encrypted and the credit card information is secure?

A . Use a custom encryption algorithm to encrypt the data and store the data on an Amazon SageMaker instance in a VPC. Use the SageMaker DeepAR algorithm to randomize the credit card numbers.
B . Use an IAM policy to encrypt the data on the Amazon S3 bucket and Amazon Kinesis to automatically discard credit card numbers and insert fake credit card numbers.
C . Use an Amazon SageMaker launch configuration to encrypt the data once it is copied to the SageMaker
instance in a VPC. Use the SageMaker principal component analysis (PCA) algorithm to reduce the length of the credit card numbers.
D . Use AWS KMS to encrypt the data on Amazon S3 and Amazon SageMaker, and redact the credit card numbers from the customer data with AWS Glue.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

AWS KMS is a service that provides encryption and key management for data stored in AWS services and applications. AWS KMS can generate and manage encryption keys that are used to encrypt and decrypt data at rest and in transit. AWS KMS can also integrate with other AWS services, such as Amazon S3 and Amazon SageMaker, to enable encryption of data using the keys stored in AWS KMS. Amazon S3 is a service that provides object storage for data in the cloud. Amazon S3 can use AWS KMS to encrypt data at rest using server-side encryption with AWS KMS-managed keys (SSE-KMS). Amazon SageMaker is a service that provides a platform for building, training, and deploying machine learning models. Amazon SageMaker can use AWS KMS to encrypt data at rest on the SageMaker instances and volumes, as well as data in transit between SageMaker and other AWS services. AWS Glue is a service that provides a serverless data integration platform for data preparation and transformation. AWS Glue can use AWS KMS to encrypt data at rest on the Glue Data Catalog and Glue ETL jobs. AWS Glue can also use built-in or custom classifiers to identify and redact sensitive data, such as credit card numbers, from the customer data1234

The other options are not valid or secure ways to encrypt the data and protect the credit card information. Using a custom encryption algorithm to encrypt the data and store the data on an Amazon SageMaker instance in a VPC is not a good practice, as custom encryption algorithms are not recommended for security and may have flaws or vulnerabilities. Using the SageMaker DeepAR algorithm to randomize the credit card numbers is not a good practice, as DeepAR is a forecasting algorithm that is not designed for data anonymization or encryption. Using an IAM policy to encrypt the data on the Amazon S3 bucket and Amazon Kinesis to automatically discard credit card numbers and insert fake credit card numbers is not a good practice, as IAM policies are not meant for data encryption, but for access control and authorization. Amazon Kinesis is a service that provides real-time data streaming and processing, but it does not have the capability to automatically discard or insert data values. Using an Amazon SageMaker launch configuration to encrypt the data once it is copied to the SageMaker instance in a VPC is not a good practice, as launch configurations are not meant for data encryption, but for specifying the instance type, security group, and user data for the SageMaker instance. Using the SageMaker principal component analysis (PCA) algorithm to reduce the length of the credit card numbers is not a good practice, as PCA is a dimensionality reduction algorithm that is not designed for data anonymization or encryption.

Question #69

A Machine Learning Specialist is using an Amazon SageMaker notebook instance in a private subnet of a corporate VPC. The ML Specialist has important data stored on the Amazon SageMaker notebook instance’s Amazon EBS volume, and needs to take a snapshot of that EBS volume. However the ML Specialist cannot find the Amazon SageMaker notebook instance’s EBS volume or Amazon EC2 instance within the VPC.

Why is the ML Specialist not seeing the instance visible in the VPC?

A . Amazon SageMaker notebook instances are based on the EC2 instances within the customer account, but they run outside of VPCs.
B . Amazon SageMaker notebook instances are based on the Amazon ECS service within customer accounts.
C . Amazon SageMaker notebook instances are based on EC2 instances running within AWS service accounts.
D . Amazon SageMaker notebook instances are based on AWS ECS instances running within AWS service accounts.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Amazon SageMaker notebook instances are fully managed environments that provide an integrated Jupyter notebook interface for data exploration, analysis, and machine learning. Amazon SageMaker notebook instances are based on EC2 instances that run within AWS service accounts, not within customer accounts. This means that the ML Specialist cannot find the Amazon SageMaker notebook instance’s EC2 instance or EBS volume within the VPC, as they are not visible or accessible to the

customer. However, the ML Specialist can still take a snapshot of the EBS volume by using the Amazon SageMaker console or API. The ML Specialist can also use VPC interface endpoints to securely connect the Amazon SageMaker notebook instance to the resources within the VPC, such as Amazon S3 buckets, Amazon EFS file systems, or Amazon RDS databases

Question #70

A manufacturing company has structured and unstructured data stored in an Amazon S3 bucket. A Machine Learning Specialist wants to use SQL to run queries on this data.

Which solution requires the LEAST effort to be able to query this data?

A . Use AWS Data Pipeline to transform the data and Amazon RDS to run queries.
B . Use AWS Glue to catalogue the data and Amazon Athena to run queries.
C . Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries.
D . Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Using AWS Glue to catalogue the data and Amazon Athena to run queries is the solution that requires the least effort to be able to query the data stored in an Amazon S3 bucket using SQL. AWS Glue is a service that provides a serverless data integration platform for data preparation and transformation. AWS Glue can automatically discover, crawl, and catalogue the data stored in various sources, such as Amazon S3, Amazon RDS, Amazon Redshift, etc. AWS Glue can also use AWS KMS to encrypt the data at rest on the Glue Data Catalog and Glue ETL jobs. AWS Glue can handle both structured and unstructured data, and support various data formats, such as CSV, JSON, Parquet, etc. AWS Glue can also use built-in or custom classifiers to identify and parse the data schema and format1 Amazon Athena is a service that provides an interactive query engine that can run SQL queries directly on data stored in Amazon S3. Amazon Athena can integrate with AWS Glue to use the Glue Data Catalog as a central metadata repository for the data sources and tables. Amazon Athena can also use AWS KMS to encrypt the data at rest on Amazon S3 and the query results. Amazon Athena can query both structured and unstructured data, and support various data formats, such as CSV, JSON, Parquet, etc. Amazon Athena can also use partitions and compression to optimize the query performance and reduce the query cost23

The other options are not valid or require more effort to query the data stored in an Amazon S3 bucket using SQL. Using AWS Data Pipeline to transform the data and Amazon RDS to run queries is not a good option, as it involves moving the data from Amazon S3 to Amazon RDS, which can incur additional time and cost. AWS Data Pipeline is a service that can orchestrate and automate data movement and transformation across various AWS services and on-premises data sources. AWS Data Pipeline can be integrated with Amazon EMR to run ETL jobs on the data stored in Amazon S3. Amazon RDS is a service that provides a managed relational database service that can run various database engines, such as MySQL, PostgreSQL, Oracle, etc. Amazon RDS can use AWS KMS to encrypt the data at rest and in transit. Amazon RDS can run SQL queries on the data stored in the database tables45 Using AWS Batch to run ETL on the data and Amazon Aurora to run the queries is not a good option, as it also involves moving the data from Amazon S3 to Amazon Aurora, which can incur additional time and cost. AWS Batch is a service that can run batch computing workloads on AWS. AWS Batch can be integrated with AWS Lambda to trigger ETL jobs on the data stored in Amazon S3. Amazon Aurora is a service that provides a compatible and scalable relational database engine that can run MySQL or PostgreSQL. Amazon Aurora can use AWS KMS to encrypt the data at rest and in transit. Amazon Aurora can run SQL queries on the data stored in the database tables. Using AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run queries is not a good option, as it is not suitable for querying data stored in Amazon S3 using SQL. AWS Lambda is a service that can run serverless functions on AWS. AWS Lambda can be integrated with Amazon S3 to trigger data transformation functions on the data stored in Amazon S3. Amazon Kinesis Data Analytics is a service that can analyze streaming data using SQL or Apache Flink. Amazon Kinesis Data Analytics can be integrated with Amazon Kinesis Data Streams or Amazon Kinesis Data Firehose to ingest streaming data sources, such as web logs, social media, IoT devices, etc. Amazon Kinesis Data Analytics is not designed for querying data stored in Amazon S3 using SQL.

Question #71

A Machine Learning Specialist receives customer data for an online shopping website. The data includes demographics, past visits, and locality information. The Specialist must develop a machine learning approach to identify the customer shopping patterns, preferences and trends to enhance the website for better service and smart recommendations.

Which solution should the Specialist recommend?

A . Latent Dirichlet Allocation (LDA) for the given collection of discrete data to identify patterns in the customer database.
B . A neural network with a minimum of three layers and random initial weights to identify patterns in the customer database
C . Collaborative filtering based on user interactions and correlations to identify patterns in the customer database
D . Random Cut Forest (RCF) over random subsamples to identify patterns in the customer database

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Collaborative filtering is a machine learning technique that recommends products or services to users based on the ratings or preferences of other users. This technique is well-suited for identifying customer shopping patterns and preferences because it takes into account the interactions between users and products.

Question #72

A Machine Learning Specialist is working with a large company to leverage machine learning within its products. The company wants to group its customers into categories based on which customers will and will not churn within the next 6 months. The company has labeled the data available to the Specialist.

Which machine learning model type should the Specialist use to accomplish this task?

A . Linear regression
B . Classification
C . Clustering
D . Reinforcement learning

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The goal of classification is to determine to which class or category a data point (customer in our case) belongs to. For classification problems, data scientists would use historical data with predefined target variables AKA labels (churner/non-churner) C answers that need to be predicted C to train an algorithm. With classification, businesses can answer the following questions:

Will this customer churn or not?

Will a customer renew their subscription?

Will a user downgrade a pricing plan?

Are there any signs of unusual customer behavior?

Reference: https://www.kdnuggets.com/2019/05/churn-prediction-machine-learning.html

Question #73

The displayed graph is from a foresting model for testing a time series.

Considering the graph only, which conclusion should a Machine Learning Specialist make about the behavior of the model?

A . The model predicts both the trend and the seasonality well.
B . The model predicts the trend well, but not the seasonality.
C . The model predicts the seasonality well, but not the trend.
D . The model does not predict the trend or the seasonality well.

Reveal Solution Hide Solution

Correct Answer: D

Question #74

A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month.

The class distribution for these features is illustrated in the figure provided.

Based on this information which model would have the HIGHEST accuracy?

A . Long short-term memory (LSTM) model with scaled exponential linear unit (SELL))
B . Logistic regression
C . Support vector machine (SVM) with non-linear kernel
D . Single perceptron with tanh activation function

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Based on the figure provided, the data is not linearly separable. Therefore, a non-linear model such as SVM with a non-linear kernel would be the best choice. SVMs are particularly effective in high-dimensional spaces and are versatile in that they can be used for both linear and non-linear data. Additionally, SVMs have a high level of accuracy and are less prone to overfitting1

Reference: 1: https://docs.aws.amazon.com/sagemaker/latest/dg/svm.html

Question #75

A Machine Learning Specialist at a company sensitive to security is preparing a dataset for model training. The dataset is stored in Amazon S3 and contains Personally Identifiable Information (Pll).

The dataset:

* Must be accessible from a VPC only.

* Must not traverse the public internet.

How can these requirements be satisfied?

A . Create a VPC endpoint and apply a bucket access policy that restricts access to the given VPC endpoint and the VPC.
B . Create a VPC endpoint and apply a bucket access policy that allows access from the given VPC endpoint and an Amazon EC2 instance.
C . Create a VPC endpoint and use Network Access Control Lists (NACLs) to allow traffic between only the given VPC endpoint and an Amazon EC2 instance.
D . Create a VPC endpoint and use security groups to restrict access to the given VPC endpoint and an Amazon EC2 instance.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

A VPC endpoint is a logical device that enables private connections between a VPC and supported AWS services. A VPC endpoint can be either a gateway endpoint or an interface endpoint. A gateway endpoint is a gateway that is a target for a specified route in the route table, used for traffic destined to a supported AWS service. An interface endpoint is an elastic network interface with a private IP address that serves as an entry point for traffic destined to a supported service1

In this case, the Machine Learning Specialist can create a gateway endpoint for Amazon S3, which is a supported service for gateway endpoints. A gateway endpoint for Amazon S3 enables the VPC to access Amazon S3 privately, without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. The traffic between the VPC and Amazon S3 does not leave the Amazon network2

To restrict access to the dataset stored in Amazon S3, the Machine Learning Specialist can apply a bucket access policy that allows access only from the given VPC endpoint and the VPC. A bucket access policy is a resource-based policy that defines who can access a bucket and what actions they can perform. A bucket access policy can use various conditions to control access, such as the source IP address, the source VPC, the source VPC endpoint, etc. In this case, the Machine Learning Specialist can use the aws:sourceVpce condition to specify the ID of the VPC endpoint, and the aws:sourceVpc condition to specify the ID of the VPC. This way, only the requests that originate from the VPC endpoint or the VPC can access the bucket that contains the dataset34

The other options are not valid or secure ways to satisfy the requirements. Creating a VPC endpoint and applying a bucket access policy that allows access from the given VPC endpoint and an Amazon EC2 instance is not a good option, as it does not restrict access to the VPC. An Amazon EC2 instance is a virtual server that runs in the AWS cloud. An Amazon EC2 instance can have a public IP address or a private IP address, depending on the network configuration. Allowing access from an Amazon EC2 instance does not guarantee that the instance is in the same VPC as the VPC endpoint, and may expose the dataset to unauthorized access. Creating a VPC endpoint and using Network Access

Control Lists (NACLs) to allow traffic between only the given VPC endpoint and an Amazon EC2 instance is not a good option, as it does not restrict access to the VPC. NACLs are stateless firewalls that can control inbound and outbound traffic at the subnet level. NACLs can use rules to allow or deny traffic based on the protocol, port, and source or destination IP address. However, NACLs do not support VPC endpoints as a source or destination, and cannot filter traffic based on the VPC endpoint ID or the VPC ID. Therefore, using NACLs does not guarantee that the traffic is from the VPC endpoint or the VPC, and may expose the dataset to unauthorized access. Creating a VPC endpoint and using security groups to restrict access to the given VPC endpoint and an Amazon EC2 instance is not a good option, as it does not restrict access to the VPC. Security groups are stateful firewalls that can control inbound and outbound traffic at the instance level. Security groups can use rules to allow or deny traffic based on the protocol, port, and source or destination. However, security groups do not support VPC endpoints as a source or destination, and cannot filter traffic based on the VPC endpoint ID or the VPC ID. Therefore, using security groups does not guarantee that the traffic is from the VPC endpoint or the VPC, and may expose the dataset to unauthorized access.

Question #76

An employee found a video clip with audio on a company’s social media feed. The language used in the video is Spanish. English is the employee’s first language, and they do not understand Spanish. The employee wants to do a sentiment analysis.

What combination of services is the MOST efficient to accomplish the task?

A . Amazon Transcribe, Amazon Translate, and Amazon Comprehend
B . Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker seq2seq
C . Amazon Transcribe, Amazon Translate, and Amazon SageMaker Neural Topic Model (NTM)
D . Amazon Transcribe, Amazon Translate, and Amazon SageMaker BlazingText

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Amazon Transcribe, Amazon Translate, and Amazon Comprehend are the most efficient combination of services to accomplish the task of sentiment analysis on a video clip with audio in Spanish. Amazon Transcribe is a service that can convert speech to text using deep learning. Amazon Transcribe can transcribe audio from various sources, such as video files, audio files, or streaming audio. Amazon Transcribe can also recognize multiple speakers, different languages, accents, dialects, and custom vocabularies. In this case, Amazon Transcribe can transcribe the audio from the video clip in Spanish to text in Spanish1 Amazon Translate is a service that can translate text from one language to another using neural machine translation. Amazon Translate can translate text from various sources, such as documents, web pages, chat messages, etc. Amazon Translate can also support multiple languages, domains, and styles. In this case, Amazon Translate can translate the text from Spanish to English2 Amazon Comprehend is a service that can analyze and derive insights from text using natural language processing. Amazon Comprehend can perform various tasks, such as sentiment analysis, entity recognition, key phrase extraction, topic modeling, etc. Amazon Comprehend can also support multiple languages and domains. In this case, Amazon Comprehend can perform sentiment analysis on the text in English and determine whether the feedback is positive, negative, neutral, or mixed3

The other options are not valid or efficient for accomplishing the task of sentiment analysis on a video clip with audio in Spanish. Amazon Comprehend, Amazon SageMaker seq2seq, and Amazon

SageMaker Neural Topic Model (NTM) are not a good combination, as they do not include a service that can transcribe speech to text, which is a necessary step for processing the audio from the video clip. Amazon Comprehend, Amazon Translate, and Amazon SageMaker BlazingText are not a good combination, as they do not include a service that can perform sentiment analysis, which is the main goal of the task. Amazon SageMaker BlazingText is a service that can train and deploy text classification and word embedding models using deep learning. Amazon SageMaker BlazingText can perform tasks such as text classification, named entity recognition, part-of-speech tagging, etc., but not sentiment analysis4

Question #77

A Machine Learning Specialist is packaging a custom ResNet model into a Docker container so the company can leverage Amazon SageMaker for training. The Specialist is using Amazon EC2 P3 instances to train the model and needs to properly configure the Docker container to leverage the NVIDIA GPUs.

What does the Specialist need to do?

A . Bundle the NVIDIA drivers with the Docker image.
B . Build the Docker container to be NVIDIA-Docker compatible.
C . Organize the Docker container’s file structure to execute on GPU instances.
D . Set the GPU flag in the Amazon SageMaker CreateTrainingJob request body

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

To leverage the NVIDIA GPUs on Amazon EC2 P3 instances for training a custom ResNet model using Amazon SageMaker, the Machine Learning Specialist needs to build the Docker container to be NVIDIA-Docker compatible. NVIDIA-Docker is a tool that enables GPU-accelerated containers to run on Docker. NVIDIA-Docker can automatically configure the Docker container with the necessary drivers, libraries, and environment variables to access the NVIDIA GPUs. NVIDIA-Docker can also isolate the GPU resources and ensure that each container has exclusive access to a GPU.

To build a Docker container that is NVIDIA-Docker compatible, the Machine Learning Specialist needs to follow these steps:

Install the NVIDIA Container Toolkit on the host machine that runs Docker. This toolkit includes the NVIDIA Container Runtime, which is a modified version of the Docker runtime that supports GPU hardware.

Use the base image provided by NVIDIA as the first line of the Dockerfile. The base image contains the NVIDIA drivers and CUDA toolkit that are required for GPU-accelerated applications. The base image can be specified as FROM nvcr.io/nvidia/cuda:tag, where tag is the version of CUDA and the operating system.

Install the required dependencies and frameworks for the ResNet model, such as PyTorch, torchvision, etc., in the Dockerfile.

Copy the ResNet model code and any other necessary files to the Docker container in the Dockerfile.

Build the Docker image using the docker build command.

Push the Docker image to a repository, such as Amazon Elastic Container Registry (Amazon ECR), using the docker push command.

Specify the Docker image URI and the instance type (ml.p3.xlarge) in the Amazon SageMaker CreateTrainingJob request body.

The other options are not valid or sufficient for building a Docker container that can leverage the NVIDIA GPUs on Amazon EC2 P3 instances. Bundling the NVIDIA drivers with the Docker image is not a good option, as it can cause driver conflicts and compatibility issues with the host machine and the NVIDIA GPUs. Organizing the Docker container’s file structure to execute on GPU instances is not a good option, as it does not ensure that the Docker container can access the NVIDIA GPUs and the CUDA toolkit. Setting the GPU flag in the Amazon SageMaker CreateTrainingJob request body is not a good option, as it does not apply to custom Docker containers, but only to built-in algorithms and frameworks that support GPU instances.

Question #78

A Machine Learning Specialist is building a logistic regression model that will predict whether or not a person will order a pizza. The Specialist is trying to build the optimal model with an ideal classification threshold.

What model evaluation technique should the Specialist use to understand how different classification thresholds will impact the model’s performance?

A . Receiver operating characteristic (ROC) curve
B . Misclassification rate
C . Root Mean Square Error (RM&)
D . L1 norm

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

A receiver operating characteristic (ROC) curve is a model evaluation technique that can be used to understand how different classification thresholds will impact the model’s performance. A ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) for various values of the classification threshold. The TPR, also known as sensitivity or recall, is the proportion of positive instances that are correctly classified as positive. The FPR, also known as the fall-out, is the proportion of negative instances that are incorrectly classified as positive. A ROC curve can show the trade-off between the TPR and the FPR for different thresholds, and help the Machine Learning Specialist to select the optimal threshold that maximizes the TPR and minimizes the FPR. A ROC curve can also be used to compare the performance of different models by calculating the area under the curve (AUC), which is a measure of how well the model can distinguish between the positive and negative classes. A higher AUC indicates a better model

Question #79

An interactive online dictionary wants to add a widget that displays words used in similar contexts. A Machine Learning Specialist is asked to provide word features for the downstream nearest neighbor model powering the widget.

What should the Specialist do to meet these requirements?

A . Create one-hot word encoding vectors.
B . Produce a set of synonyms for every word using Amazon Mechanical Turk.
C . Create word embedding factors that store edit distance with every other word.
D . Download word embedding’s pre-trained on a large corpus.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

Word embeddings are a type of dense representation of words, which encode semantic meaning in a vector form. These embeddings are typically pre-trained on a large corpus of text data, such as a large set of books, news articles, or web pages, and capture the context in which words are used. Word embeddings can be used as features for a nearest neighbor model, which can be used to find words used in similar contexts. Downloading pre-trained word embeddings is a good way to get started quickly and leverage the strengths of these representations, which have been optimized on a large amount of data. This is likely to result in more accurate and reliable features than other options like one-hot encoding, edit distance, or using Amazon Mechanical Turk to produce synonyms.

Reference: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-object2vec-adds-new-features-that-support-automatic-negative-sampling-and-speed-up-training/

Question #80

A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked.

Which services are integrated with Amazon SageMaker to track this information? (Select TWO.)

A . AWS CloudTrail
B . AWS Health
C . AWS Trusted Advisor
D . Amazon CloudWatch
E . AWS Config

Reveal Solution Hide Solution

Correct Answer: A, D
A, D

Explanation:

The services that are integrated with Amazon SageMaker to track the information that the Machine Learning Specialist needs are AWS CloudTrail and Amazon CloudWatch. AWS CloudTrail is a service that records the API calls and events for AWS services, including Amazon SageMaker. AWS CloudTrail can track the actions performed by the Data Scientists, such as creating notebooks, training models, and deploying endpoints. AWS CloudTrail can also provide information such as the identity of the user, the time of the action, the parameters used, and the response elements returned. AWS CloudTrail can help the Machine Learning Specialist to monitor the usage and activity of Amazon SageMaker, as well as to audit and troubleshoot any issues1 Amazon CloudWatch is a service that collects and analyzes the metrics and logs for AWS services, including Amazon SageMaker. Amazon CloudWatch can track the performance and utilization of the Amazon SageMaker endpoints, such as the CPU and GPU utilization, the inference latency, the number of invocations, etc. Amazon CloudWatch can also track the errors and alarms that are generated when an endpoint is invoked, such as the model errors, the throttling errors, the HTTP errors, etc. Amazon CloudWatch can help the Machine Learning Specialist to optimize the operational performance and reliability of Amazon

SageMaker, as well as to set up notifications and actions based on the metrics and logs

Question #81

A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The Specialist needs to understand whether the model is more frequently overestimating or underestimating the target.

What option can the Specialist use to determine whether it is overestimating or underestimating the target value?

A . Root Mean Square Error (RMSE)
B . Residual plots
C . Area under the curve
D . Confusion matrix

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Residual plots are a model evaluation technique that can be used to understand whether a regression model is more frequently overestimating or underestimating the target. Residual plots are graphs that plot the residuals (the difference between the actual and predicted values) against the predicted values or other variables. Residual plots can help the Machine Learning Specialist to identify the patterns and trends in the residuals, such as the direction, shape, and distribution. Residual plots can also reveal the presence of outliers, heteroscedasticity, non-linearity, or other problems in the model12

To determine whether the model is overestimating or underestimating the target, the Machine Learning Specialist can use a residual plot that plots the residuals against the predicted values. This type of residual plot is also known as a prediction error plot. A prediction error plot can show the magnitude and direction of the errors made by the model. If the model is overestimating the target, the residuals will be negative, and the points will be below the zero line. If the model is underestimating the target, the residuals will be positive, and the points will be above the zero line. If the model is accurate, the residuals will be close to zero, and the points will be scattered around the zero line. A prediction error plot can also show the variance and bias of the model. If the model has high variance, the residuals will have a large spread, and the points will be far from the zero line. If the model has high bias, the residuals will have a systematic pattern, such as a curve or a slope, and the points will not be randomly distributed around the zero line. A prediction error plot can help the Machine Learning Specialist to optimize the model by adjusting the complexity, features, or parameters of the model34

The other options are not valid or suitable for determining whether the model is overestimating or underestimating the target. Root Mean Square Error (RMSE) is a model evaluation metric that measures the average magnitude of the errors made by the model. RMSE is the square root of the mean of the squared residuals. RMSE can indicate the overall accuracy and performance of the model, but it cannot show the direction or distribution of the errors. RMSE can also be influenced by outliers or extreme values, and it may not be comparable across different models or datasets5 Area under the curve (AUC) is a model evaluation metric that measures the ability of the model to distinguish between the positive and negative classes. AUC is the area under the receiver operating characteristic (ROC) curve, which plots the true positive rate against the false positive rate for various classification thresholds. AUC can indicate the overall quality and performance of the model, but it is only applicable for binary classification models, not regression models. AUC cannot show the magnitude or direction of the errors made by the model. Confusion matrix is a model evaluation technique that summarizes the number of correct and incorrect predictions made by the model for each class. A confusion matrix is a table that shows the counts of true positives, false positives, true negatives, and false negatives for each class. A confusion matrix can indicate the accuracy, precision, recall, and F1-score of the model for each class, but it is only applicable for classification models, not regression models. A confusion matrix cannot show the magnitude or direction of the errors made by the model.

Question #82

Based on this information, which model would have the HIGHEST recall with respect to the fraudulent class?

A . Decision tree
B . Linear support vector machine (SVM)
C . Naive Bayesian classifier
D . Single Perceptron with sigmoidal activation function

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Based on the figure provided, a decision tree would have the highest recall with respect to the fraudulent class. Recall is a model evaluation metric that measures the proportion of actual positive instances that are correctly classified by the model. Recall is calculated as follows: Recall = True Positives / (True Positives + False Negatives)

A decision tree is a type of machine learning model that can perform classification tasks by splitting the data into smaller and purer subsets based on a series of rules or conditions. A decision tree can handle both linear and non-linear data, and can capture complex patterns and interactions among the features. A decision tree can also be easily visualized and interpreted1

In this case, the data is not linearly separable, and has a clear pattern of seasonality. The fraudulent class forms a large circle in the center of the plot, while the normal class is scattered around the edges. A decision tree can use the transaction month and the age of account as the splitting criteria, and create a circular boundary that separates the fraudulent class from the normal class. A decision tree can achieve a high recall for the fraudulent class, as it can correctly identify most of the black dots as positive instances, and minimize the number of false negatives. A decision tree can also adjust the depth and complexity of the tree to balance the trade-off between recall and precision23 The other options are not valid or suitable for achieving a high recall for the fraudulent class. A linear support vector machine (SVM) is a type of machine learning model that can perform classification tasks by finding a linear hyperplane that maximizes the margin between the classes. A linear SVM can handle linearly separable data, but not non-linear data. A linear SVM cannot capture the circular pattern of the fraudulent class, and may misclassify many of the black dots as negative instances, resulting in a low recall4 A naive Bayesian classifier is a type of machine learning model that can perform classification tasks by applying the Bayes’ theorem and assuming conditional independence among the features. A naive Bayesian classifier can handle both linear and non-linear data, and can incorporate prior knowledge and probabilities into the model. However, a naive Bayesian classifier may not perform well when the features are correlated or dependent, as in this case. A naive Bayesian classifier may not capture the circular pattern of the fraudulent class, and may misclassify many of the black dots as negative instances, resulting in a low recall5 A single perceptron with sigmoidal activation function is a type of machine learning model that can perform classification tasks by applying a weighted linear combination of the features and a non-linear activation function. A single perceptron with sigmoidal activation function can handle linearly separable data, but not non-linear data. A single perceptron with sigmoidal activation function cannot capture the circular pattern of the fraudulent class, and may misclassify many of the black dots as negative instances, resulting in a low recall.

Question #83

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, which common parameters MUST be specified? (Select THREE.)

A . The training channel identifying the location of training data on an Amazon S3 bucket.
B . The validation channel identifying the location of validation data on an Amazon S3 bucket.
C . The 1AM role that Amazon SageMaker can assume to perform tasks on behalf of the users.
D . Hyperparameters in a JSON array as documented for the algorithm used.
E . The Amazon EC2 instance class specifying whether training will be run using CPU or GPU.
F . The output path specifying where on an Amazon S3 bucket the trained model will persist.

Reveal Solution Hide Solution

Correct Answer: A, C, F
A, C, F

Explanation:

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, the common parameters that must be specified are:

The training channel identifying the location of training data on an Amazon S3 bucket. This parameter tells SageMaker where to find the input data for the algorithm and what format it is in. For example, TrainingInputMode: File means that the input data is in files stored in S3.

The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users. This parameter grants SageMaker the necessary permissions to access the S3 buckets, ECR repositories, and other AWS resources needed for the training job. For example, RoleArn: arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole-

20200303T150948 means that SageMaker will use the specified role to run the training job. The output path specifying where on an Amazon S3 bucket the trained model will persist. This parameter tells SageMaker where to save the model artifacts, such as the model weights and parameters, after the training job is completed. For example, OutputDataConfig: {S3OutputPath:

s3://my-bucket/my-training-job} means that SageMaker will store the model artifacts in the specified

S3 location.

The validation channel identifying the location of validation data on an Amazon S3 bucket is an optional parameter that can be used to provide a separate dataset for evaluating the model performance during the training process. This parameter is not required for all algorithms and can be omitted if the validation data is not available or not needed.

The hyperparameters in a JSON array as documented for the algorithm used is another optional parameter that can be used to customize the behavior and performance of the algorithm. This parameter is specific to each algorithm and can be used to tune the model accuracy, speed, complexity, and other aspects. For example, HyperParameters: {num_round: "10", objective: "binary:logistic"} means that the XGBoost algorithm will use 10 boosting rounds and the logistic loss function for binary classification.

The Amazon EC2 instance class specifying whether training will be run using CPU or GPU is not a parameter that is specified when submitting a training job using a built-in algorithm. Instead, this parameter is specified when creating a training instance, which is a containerized environment that runs the training code and algorithm. For example, ResourceConfig: {InstanceType: ml.m5.xlarge,

InstanceCount: 1, VolumeSizeInGB: 10} means that SageMaker will use one m5.xlarge instance with 10 GB of storage for the training instance.

Reference: Train a Model with Amazon SageMaker

Use Amazon SageMaker Built-in Algorithms or Pre-trained Models CreateTrainingJob – Amazon SageMaker Service

Question #84

A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.

Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population.

How should the Data Scientist correct this issue?

A . Drop all records from the dataset where age has been set to 0.
B . Replace the age field value for records with a value of 0 with the mean or median value from the dataset.
C . Drop the age feature from the dataset and train the model using the rest of the features.
D . Use k-means clustering to handle missing features.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The best way to handle the missing values in the patient age feature is to replace them with the mean or median value from the dataset. This is a common technique for imputing missing values that preserves the overall distribution of the data and avoids introducing bias or reducing the sample size. Dropping the records or the feature would result in losing valuable information and reducing the accuracy of the model. Using k-means clustering would not be appropriate for handling missing values in a single feature, as it is a method for grouping similar data points based on multiple features.

Reference: Effective Strategies to Handle Missing Values in Data Analysis

How To Handle Missing Values In Machine Learning Data With Weka How to handle missing values in Python – Machine Learning Plus

Question #85

A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL.

Which storage scheme is MOST adapted to this scenario?

A . Store datasets as files in Amazon S3.
B . Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.
C . Store datasets as tables in a multi-node Amazon Redshift cluster.
D . Store datasets as global tables in Amazon DynamoDB.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The best storage scheme for this scenario is to store datasets as files in Amazon S3. Amazon S3 is a scalable, cost-effective, and durable object storage service that can store any amount and type of data. Amazon S3 also supports querying data using SQL with Amazon Athena, a serverless interactive

query service that can analyze data directly in S3. This way, the Data Science team can easily explore and analyze their datasets without having to load them into a database or a compute instance. The other options are not as suitable for this scenario because:

Storing datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance would limit the scalability and availability of the data, as EBS volumes are only accessible within a single availability zone and have a maximum size of 16 TiB. Also, EBS volumes are more expensive than S3 buckets and require provisioning and managing EC2 instances.

Storing datasets as tables in a multi-node Amazon Redshift cluster would incur higher costs and complexity than using S3 and Athena. Amazon Redshift is a data warehouse service that is optimized for analytical queries over structured or semi-structured data. However, it requires setting up and maintaining a cluster of nodes, loading data into tables, and choosing the right distribution and sort keys for optimal performance. Moreover, Amazon Redshift charges for both storage and compute, while S3 and Athena only charge for the amount of data stored and scanned, respectively.

Storing datasets as global tables in Amazon DynamoDB would not be feasible for large amounts of

data, as DynamoDB is a key-value and document database service that is designed for fast and

consistent performance at any scale. However, DynamoDB has a limit of 400 KB per item and 25 GB

per partition key value, which may not be enough for storing large datasets. Also, DynamoDB does

not support SQL queries natively, and would require using a service like Amazon EMR or AWS Glue to

run SQL queries over DynamoDB data.

Reference: Amazon S3 – Cloud Object Storage

Amazon Athena C Interactive SQL Queries for Data in Amazon S3 Amazon EBS – Amazon Elastic Block Store (EBS) Amazon Redshift C Data Warehouse Solution – AWS

Amazon DynamoDB C NoSQL Cloud Database Service

Question #86

A Machine Learning Specialist working for an online fashion company wants to build a data ingestion solution for the company’s Amazon S3-based data lake.

The Specialist wants to create a set of ingestion mechanisms that will enable future capabilities comprised of:

• Real-time analytics

• Interactive analytics of historical data

• Clickstream analytics

• Product recommendations

Which services should the Specialist use?

A . AWS Glue as the data dialog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights; Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics; Amazon EMR to generate personalized product recommendations
B . Amazon Athena as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for near-realtime data insights; Amazon Kinesis Data Firehose for clickstream analytics; AWS Glue to generate personalized product recommendations
C . AWS Glue as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for historical data insights; Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics; Amazon EMR to generate personalized product recommendations
D . Amazon Athena as the data catalog; Amazon Kinesis Data Streams and Amazon Kinesis Data
Analytics for historical data insights; Amazon DynamoDB streams for clickstream analytics; AWS Glue to generate personalized product recommendations

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The best services to use for building a data ingestion solution for the company’s Amazon S3-based data lake are:

AWS Glue as the data catalog: AWS Glue is a fully managed extract, transform, and load (ETL) service that can discover, crawl, and catalog data from various sources and formats, and make it available for analysis. AWS Glue can also generate ETL code in Python or Scala to transform, enrich, and join data using AWS Glue Data Catalog as the metadata repository. AWS Glue Data Catalog is a central metadata store that integrates with Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum, allowing users to create a unified view of their data across various sources and formats.

Amazon Kinesis Data Streams and Amazon Kinesis Data Analytics for real-time data insights: Amazon Kinesis Data Streams is a service that enables users to collect, process, and analyze real-time streaming data at any scale. Users can create data streams that can capture data from various sources, such as web and mobile applications, IoT devices, and social media platforms. Amazon Kinesis Data Analytics is a service that allows users to analyze streaming data using standard SQL queries or Apache Flink applications. Users can create real-time dashboards, metrics, and alerts based on the streaming data analysis results.

Amazon Kinesis Data Firehose for delivery to Amazon ES for clickstream analytics: Amazon Kinesis Data Firehose is a service that enables users to load streaming data into data lakes, data stores, and analytics services. Users can configure Kinesis Data Firehose to automatically deliver data to various destinations, such as Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and third-party solutions. For clickstream analytics, users can use Kinesis Data Firehose to deliver data to Amazon OpenSearch Service, a fully managed service that offers search and analytics capabilities for log data. Users can use Amazon OpenSearch Service to perform interactive analysis and visualization of clickstream data using Kibana, an open-source tool that is integrated with Amazon OpenSearch Service.

Amazon EMR to generate personalized product recommendations: Amazon EMR is a service that enables users to run distributed data processing frameworks, such as Apache Spark, Apache Hadoop, and Apache Hive, on scalable clusters of EC2 instances. Users can use Amazon EMR to perform advanced analytics, such as machine learning, on large and complex datasets stored in Amazon S3 or other sources. For product recommendations, users can use Amazon EMR to run Spark MLlib, a library that provides scalable machine learning algorithms, such as collaborative filtering, to generate personalized recommendations based on user behavior and preferences.

Reference:

AWS Glue – Fully Managed ETL Service

Amazon Kinesis – Data Streaming Service

Amazon OpenSearch Service – Managed OpenSearch Service Amazon EMR – Managed Hadoop Framework

Question #87

A company is observing low accuracy while training on the default built-in image classification algorithm in Amazon SageMaker. The Data Science team wants to use an Inception neural network architecture instead of a ResNet architecture.

Which of the following will accomplish this? (Select TWO.)

A . Customize the built-in image classification algorithm to use Inception and use this for model training.
B . Create a support case with the SageMaker team to change the default image classification algorithm to Inception.
C . Bundle a Docker container with TensorFlow Estimator loaded with an Inception network and use this for model training.
D . Use custom code in Amazon SageMaker with TensorFlow Estimator to load the model with an Inception network and use this for model training.
E . Download and apt-get install the inception network code into an Amazon EC2 instance and use this instance as a Jupyter notebook in Amazon SageMaker.

Reveal Solution Hide Solution

Correct Answer: C, D
C, D

Explanation:

The best options to use an Inception neural network architecture instead of a ResNet architecture for image classification in Amazon SageMaker are:

Bundle a Docker container with TensorFlow Estimator loaded with an Inception network and use this for model training. This option allows users to customize the training environment and use any TensorFlow model they want. Users can create a Docker image that contains the TensorFlow Estimator API and the Inception model from the TensorFlow Hub, and push it to Amazon ECR. Then, users can use the SageMaker Estimator class to train the model using the custom Docker image and the training data from Amazon S3.

Use custom code in Amazon SageMaker with TensorFlow Estimator to load the model with an Inception network and use this for model training. This option allows users to use the built-in TensorFlow container provided by SageMaker and write custom code to load and train the Inception model. Users can use the TensorFlow Estimator class to specify the custom code and the training data from Amazon S3. The custom code can use the TensorFlow Hub module to load the Inception model and fine-tune it on the training data.

The other options are not feasible for this scenario because:

Customize the built-in image classification algorithm to use Inception and use this for model training. This option is not possible because the built-in image classification algorithm in SageMaker does not support customizing the neural network architecture. The built-in algorithm only supports ResNet models with different depths and widths.

Create a support case with the SageMaker team to change the default image classification algorithm to Inception. This option is not realistic because the SageMaker team does not provide such a service. Users cannot request the SageMaker team to change the default algorithm or add new algorithms to the built-in ones.

Download and apt-get install the inception network code into an Amazon EC2 instance and use this instance as a Jupyter notebook in Amazon SageMaker. This option is not advisable because it does not leverage the benefits of SageMaker, such as managed training and deployment, distributed training, and automatic model tuning. Users would have to manually install and configure the Inception network code and the TensorFlow framework on the EC2 instance, and run the training and inference code on the same instance, which may not be optimal for performance and scalability.

Reference: Use Your Own Algorithms or Models with Amazon SageMaker Use the SageMaker TensorFlow Serving Container

TensorFlow Hub

Question #88

A Machine Learning Specialist built an image classification deep learning model. However the Specialist ran into an overfitting problem in which the training and testing accuracies were 99% and 75%r respectively.

How should the Specialist address this issue and what is the reason behind it?

A . The learning rate should be increased because the optimization process was trapped at a local minimum.
B . The dropout rate at the flatten layer should be increased because the model is not generalized enough.
C . The dimensionality of dense layer next to the flatten layer should be increased because the model is not complex enough.
D . The epoch number should be increased because the optimization process was terminated before it reached the global minimum.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

The best way to address the overfitting problem in image classification is to increase the dropout rate at the flatten layer because the model is not generalized enough. Dropout is a regularization technique that randomly drops out some units from the neural network during training, reducing the co-adaptation of features and preventing overfitting. The flatten layer is the layer that converts the output of the convolutional layers into a one-dimensional vector that can be fed into the dense layers. Increasing the dropout rate at the flatten layer means that more features from the convolutional layers will be ignored, forcing the model to learn more robust and generalizable representations from the remaining features.

The other options are not correct for this scenario because:

Increasing the learning rate would not help with the overfitting problem, as it would make the optimization process more unstable and prone to overshooting the global minimum. A high learning rate can also cause the model to diverge or oscillate around the optimal solution, resulting in poor performance and accuracy.

Increasing the dimensionality of the dense layer next to the flatten layer would not help with the overfitting problem, as it would make the model more complex and increase the number of parameters to be learned. A more complex model can fit the training data better, but it can also memorize the noise and irrelevant details in the data, leading to overfitting and poor generalization. Increasing the epoch number would not help with the overfitting problem, as it would make the model train longer and more likely to overfit the training data. A high epoch number can cause the model to converge to the global minimum, but it can also cause the model to over-optimize the training data and lose the ability to generalize to new data.

Reference: Dropout: A Simple Way to Prevent Neural Networks from Overfitting How to Reduce Overfitting With Dropout Regularization in Keras

How to Control the Stability of Training Neural Networks With the Learning Rate

How to Choose the Number of Hidden Layers and Nodes in a Feedforward Neural Network?

How to decide the optimal number of epochs to train a neural network?

Question #89

A Machine Learning team uses Amazon SageMaker to train an Apache MXNet handwritten digit classifier model using a research dataset. The team wants to receive a notification when the model is overfitting. Auditors want to view the Amazon SageMaker log activity report to ensure there are no unauthorized API calls.

What should the Machine Learning team do to address the requirements with the least amount of code and fewest steps?

A . Implement an AWS Lambda function to long Amazon SageMaker API calls to Amazon S3. Add code to push a custom metric to Amazon CloudWatch. Create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting.
B . Use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. Add code to push a custom metric to Amazon CloudWatch. Create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting.
C . Implement an AWS Lambda function to log Amazon SageMaker API calls to AWS CloudTrail. Add code to push a custom metric to Amazon CloudWatch. Create an alarm in CloudWatch with Amazon SNS to receive a notification when the model is overfitting.
D . Use AWS CloudTrail to log Amazon SageMaker API calls to Amazon S3. Set up Amazon SNS to receive a notification when the model is overfitting.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

To log Amazon SageMaker API calls, the team can use AWS CloudTrail, which is a service that provides a record of actions taken by a user, role, or an AWS service in SageMaker1. CloudTrail captures all API calls for SageMaker, with the exception of InvokeEndpoint and InvokeEndpointAsync, as events1. The calls captured include calls from the SageMaker console and code calls to the SageMaker API operations1. The team can create a trail to enable continuous delivery of CloudTrail events to an Amazon S3 bucket, and configure other AWS services to further analyze and act upon the event data collected in CloudTrail logs1. The auditors can view the CloudTrail log activity report in the CloudTrail console or download the log files from the S3 bucket1.

To receive a notification when the model is overfitting, the team can add code to push a custom metric to Amazon CloudWatch, which is a service that provides monitoring and observability for AWS resources and applications2. The team can use the MXNet metric API to define and compute the custom metric, such as the validation accuracy or the validation loss, and use the boto3 CloudWatch client to put the metric data to CloudWatch3. The team can then create an alarm in CloudWatch with Amazon SNS to receive a notification when the custom metric crosses a threshold that indicates overfitting. For example, the team can set the alarm to trigger when the validation loss increases for a certain number of consecutive periods, which means the model is learning the noise in the training data and not generalizing well to the validation data.

Reference:

1: Log Amazon SageMaker API Calls with AWS CloudTrail – Amazon SageMaker

2: What Is Amazon CloudWatch? – Amazon CloudWatch

3: Metric API ― Apache MXNet documentation

: CloudWatch ― Boto 3 Docs 1.20.21 documentation

: Creating Amazon CloudWatch Alarms – Amazon CloudWatch

: What is Amazon Simple Notification Service? – Amazon Simple Notification Service

: Overfitting and Underfitting – Machine Learning Crash Course

Question #90

A Machine Learning Specialist is implementing a full Bayesian network on a dataset that describes public transit in New York City. One of the random variables is discrete, and represents the number of minutes New Yorkers wait for a bus given that the buses cycle every 10 minutes, with a mean of 3 minutes.

Which prior probability distribution should the ML Specialist use for this variable?

A . Poisson distribution ,
B . Uniform distribution
C . Normal distribution
D . Binomial distribution

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The prior probability distribution for the discrete random variable that represents the number of minutes New Yorkers wait for a bus is a Poisson distribution. A Poisson distribution is suitable for modeling the number of events that occur in a fixed interval of time or space, given a known average rate of occurrence. In this case, the event is waiting for a bus, the interval is 10 minutes, and the average rate is 3 minutes. The Poisson distribution can capture the variability of the waiting time, which can range from 0 to 10 minutes, with different probabilities.

Reference:

1: Poisson Distribution – Amazon SageMaker

2: Poisson Distribution – Wikipedia

Question #91

A Data Science team within a large company uses Amazon SageMaker notebooks to access data stored in Amazon S3 buckets. The IT Security team is concerned that internet-enabled notebook instances create a security vulnerability where malicious code running on the instances could compromise data privacy. The company mandates that all instances stay within a secured VPC with no internet access, and data communication traffic must stay within the AWS network.

How should the Data Science team configure the notebook instance placement to meet these requirements?

A . Associate the Amazon SageMaker notebook with a private subnet in a VPC. Place the Amazon SageMaker endpoint and S3 buckets within the same VPC.
B . Associate the Amazon SageMaker notebook with a private subnet in a VPC. Use 1AM policies to grant access to Amazon S3 and Amazon SageMaker.
C . Associate the Amazon SageMaker notebook with a private subnet in a VPC. Ensure the VPC has S3 VPC endpoints and Amazon SageMaker VPC endpoints attached to it.
D . Associate the Amazon SageMaker notebook with a private subnet in a VPC. Ensure the VPC has a NAT gateway and an associated security group allowing only outbound connections to Amazon S3
and Amazon SageMaker

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

To configure the notebook instance placement to meet the requirements, the Data Science team should associate the Amazon SageMaker notebook with a private subnet in a VPC. A VPC is a virtual network that is logically isolated from other networks in AWS. A private subnet is a subnet that has no internet gateway attached to it, and therefore cannot communicate with the internet. By placing the notebook instance in a private subnet, the team can ensure that it stays within a secured VPC with no internet access.

However, to access data stored in Amazon S3 buckets and other AWS services, the team needs to ensure that the VPC has S3 VPC endpoints and Amazon SageMaker VPC endpoints attached to it. A VPC endpoint is a gateway that enables private connections between the VPC and supported AWS services. A VPC endpoint does not require an internet gateway, a NAT device, or a VPN connection, and ensures that the traffic between the VPC and the AWS service does not leave the AWS network. By using VPC endpoints, the team can access Amazon S3 and Amazon SageMaker from the notebook instance without compromising data privacy or security.

Reference: What Is Amazon VPC? – Amazon Virtual Private Cloud

: Subnet Routing – Amazon Virtual Private Cloud

: VPC Endpoints – Amazon Virtual Private Cloud

Question #92

A Machine Learning Specialist has created a deep learning neural network model that performs well on the training data but performs poorly on the test data.

Which of the following methods should the Specialist consider using to correct this? (Select THREE.)

A . Decrease regularization.
B . Increase regularization.
C . Increase dropout.
D . Decrease dropout.
E . Increase feature combinations.
F . Decrease feature combinations.

Reveal Solution Hide Solution

Correct Answer: B, C, F
B, C, F

Explanation:

The problem of poor performance on the test data is a sign of overfitting, which means the model has learned the training data too well and failed to generalize to new and unseen data. To correct this, the Machine Learning Specialist should consider using methods that reduce the complexity of the model and increase its ability to generalize.

Some of these methods are:

Increase regularization: Regularization is a technique that adds a penalty term to the loss function of the model, which reduces the magnitude of the model weights and prevents overfitting. There are different types of regularization, such as L1, L2, and elastic net, that apply different penalties to the weights1.

Increase dropout: Dropout is a technique that randomly drops out some units or connections in the neural network during training, which reduces the co-dependency of the units and prevents overfitting. Dropout can be applied to different layers of the network, and the dropout rate can be tuned to control the amount of dropout2.

Decrease feature combinations: Feature combinations are the interactions between different input

features that can be used to create new features for the model. However, too many feature

combinations can increase the complexity of the model and cause overfitting. Therefore, the

Specialist should decrease the number of feature combinations and select only the most relevant and

informative ones for the model3.

Reference:

1: Regularization for Deep Learning – Amazon SageMaker

2: Dropout – Amazon SageMaker

3: Feature Engineering – Amazon SageMaker

Question #93

A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data.

The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards.

Which solution should the Data Scientist build to satisfy the requirements?

A . Create a schema in the AWS Glue Data Catalog of the incoming data format. Use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform the data to Apache Parquet or ORC format using the AWS Glue Data Catalog before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to Bl tools using the Athena Java Database Connectivity (JDBC) connector.
B . Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and writes the data to a processed data location in Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to Bl tools using the Athena Java Database Connectivity (JDBC) connector.
C . Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and inserts it into an Amazon RDS PostgreSQL database. Have the Analysts query and run dashboards from the RDS database.
D . Use Amazon Kinesis Data Analytics to ingest the streaming data and perform real-time SQL queries to convert the records to Apache Parquet before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena and connect to Bl tools using the Athena Java Database Connectivity (JDBC) connector.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

To create a serverless ingestion and analytics solution for high-velocity, real-time streaming data, the Data Scientist should use the following AWS services:

AWS Glue Data Catalog: This is a managed service that acts as a central metadata repository for data assets across AWS and on-premises data sources. The Data Scientist can use AWS Glue Data Catalog to create a schema of the incoming data format, which defines the structure, format, and data types of the JSON records. The schema can be used by other AWS services to understand and process the data1.

Amazon Kinesis Data Firehose: This is a fully managed service that delivers real-time streaming data to destinations such as Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk. The Data Scientist can use Amazon Kinesis Data Firehose to stream the data from the source and transform the data to a query-optimized, columnar format such as Apache Parquet or ORC using the AWS Glue Data Catalog before delivering to Amazon S3. This enables efficient compression, partitioning, and fast analytics on the data2.

Amazon S3: This is an object storage service that offers high durability, availability, and scalability. The Data Scientist can use Amazon S3 as the output datastore for the transformed data, which can be organized into buckets and prefixes according to the desired partitioning scheme. Amazon S3 also integrates with other AWS services such as Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum for analytics3.

Amazon Athena: This is a serverless interactive query service that allows users to analyze data in Amazon S3 using standard SQL. The Data Scientist can use Amazon Athena to run SQL queries against the data in Amazon S3 and connect to existing business intelligence dashboards using the Athena Java Database Connectivity (JDBC) connector. Amazon Athena leverages the AWS Glue Data Catalog to access the schema information and supports formats such as Parquet and ORC for fast and cost-effective queries4.

Reference:

1: What Is the AWS Glue Data Catalog? – AWS Glue

2: What Is Amazon Kinesis Data Firehose? – Amazon Kinesis Data Firehose

3: What Is Amazon S3? – Amazon Simple Storage Service

4: What Is Amazon Athena? – Amazon Athena

Question #94

A company is setting up an Amazon SageMaker environment. The corporate data security policy does not allow communication over the internet.

How can the company enable the Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances?

A . Create a NAT gateway within the corporate VPC.
B . Route Amazon SageMaker traffic through an on-premises network.
C . Create Amazon SageMaker VPC interface endpoints within the corporate VPC.
D . Create VPC peering with Amazon VPC hosting Amazon SageMaker.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

To enable the Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances, the company should create Amazon SageMaker VPC interface endpoints within the corporate VPC. A VPC interface endpoint is a gateway that enables private connections between the VPC and supported AWS services without requiring an internet gateway, a NAT device, a VPN connection, or an AWS Direct Connect connection. The instances in the VPC do not need to connect to the public internet in order to communicate with the Amazon SageMaker service. The VPC interface endpoint connects the VPC directly to the Amazon SageMaker service using AWS PrivateLink, which ensures that the traffic between the VPC and the service does not leave the AWS network1.

Reference:

1: Connect to SageMaker Within your VPC – Amazon SageMaker

Question #95

An office security agency conducted a successful pilot using 100 cameras installed at key locations within the main office. Images from the cameras were uploaded to Amazon S3 and tagged using Amazon Rekognition, and the results were stored in Amazon ES. The agency is now looking to expand the pilot into a full production system using thousands of video cameras in its office locations globally. The goal is to identify activities performed by non-employees in real time.

Which solution should the agency consider?

A . Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection of known employees, and alert when non-employees are detected.
B . Use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Image to detect faces from a collection of known employees and alert when non-employees are detected.
C . Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to Amazon Kinesis Video Streams for each camera. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection on each stream, and alert when nonemployees are detected.
D . Install AWS DeepLens cameras and use the DeepLens_Kinesis_Video module to stream video to Amazon Kinesis Video Streams for each camera. On each stream, run an AWS Lambda function to capture image fragments and then call Amazon Rekognition Image to detect faces from a collection of
known employees, and alert when non-employees are detected.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

The solution that the agency should consider is to use a proxy server at each local office and for each camera, and stream the RTSP feed to a unique Amazon Kinesis Video Streams video stream. On each stream, use Amazon Rekognition Video and create a stream processor to detect faces from a collection of known employees, and alert when non-employees are detected.

This solution has the following advantages:

It can handle thousands of video cameras in real time, as Amazon Kinesis Video Streams can scale elastically to support any number of producers and consumers1.

It can leverage the Amazon Rekognition Video API, which is designed and optimized for video analysis, and can detect faces in challenging conditions such as low lighting, occlusions, and different poses2.

It can use a stream processor, which is a feature of Amazon Rekognition Video that allows you to create a persistent application that analyzes streaming video and stores the results in a Kinesis data stream3. The stream processor can compare the detected faces with a collection of known employees, which is a container for persisting faces that you want to search for in the input video stream4. The stream processor can also send notifications to Amazon Simple Notification Service (Amazon SNS) when non-employees are detected, which can trigger downstream actions such as sending alerts or storing the events in Amazon Elasticsearch Service (Amazon ES)3.

Reference:

1: What Is Amazon Kinesis Video Streams? – Amazon Kinesis Video Streams

2: Detecting and Analyzing Faces – Amazon Rekognition

3: Using Amazon Rekognition Video Stream Processor – Amazon Rekognition

4: Working with Stored Faces – Amazon Rekognition

Question #96

A financial services company is building a robust serverless data lake on Amazon S3.

The data lake should be flexible and meet the following requirements:

* Support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum.

* Support event-driven ETL pipelines.

* Provide a quick and easy way to understand metadata.

Which approach meets trfese requirements?

A . Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and an AWS Glue Data catalog to search and discover metadata.
B . Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Batch job, and an external Apache Hive metastore to search and discover metadata.
C . Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Batch job, and an AWS Glue Data Catalog to search and discover metadata.
D . Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Glue ETL job, and an external Apache Hive metastore to search and discover metadata.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

To build a robust serverless data lake on Amazon S3 that meets the requirements, the financial services company should use the following AWS services:

AWS Glue crawler: This is a service that connects to a data store, progresses through a prioritized list of classifiers to determine the schema for the data, and then creates metadata tables in the AWS Glue Data Catalog1. The company can use an AWS Glue crawler to crawl the S3 data and infer the schema, format, and partition structure of the data. The crawler can also detect schema changes and update the metadata tables accordingly. This enables the company to support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum, which are serverless interactive query services that use the AWS Glue Data Catalog as a central location for storing and retrieving table metadata23.

AWS Lambda function: This is a service that lets you run code without provisioning or managing servers. You pay only for the compute time you consume – there is no charge when your code is not running. You can also use AWS Lambda to create event-driven ETL pipelines, by triggering other AWS services based on events such as object creation or deletion in S3 buckets4. The company can use an AWS Lambda function to trigger an AWS Glue ETL job, which is a serverless way to extract, transform, and load data for analytics. The AWS Glue ETL job can perform various data processing tasks, such as converting data formats, filtering, aggregating, joining, and more.

AWS Glue Data Catalog: This is a managed service that acts as a central metadata repository for data assets across AWS and on-premises data sources. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data. The company can use the AWS Glue Data Catalog to search and discover metadata, such as table definitions, schemas, and partitions. The AWS Glue Data Catalog also integrates with Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL jobs, providing a consistent view of the data across different query and analysis services.

Reference:

1: What Is a Crawler? – AWS Glue

2: What Is Amazon Athena? – Amazon Athena

3: Amazon Redshift Spectrum – Amazon Redshift

4: What is AWS Lambda? – AWS Lambda

: AWS Glue ETL Jobs – AWS Glue

: What Is the AWS Glue Data Catalog? – AWS Glue

Question #97

A company’s Machine Learning Specialist needs to improve the training speed of a time-series forecasting model using TensorFlow. The training is currently implemented on a single-GPU machine and takes approximately 23 hours to complete. The training needs to be run daily.

The model accuracy js acceptable, but the company anticipates a continuous increase in the size of the training data and a need to update the model on an hourly, rather than a daily, basis. The company also wants to minimize coding effort and infrastructure changes

What should the Machine Learning Specialist do to the training solution to allow it to scale for future demand?

A . Do not change the TensorFlow code. Change the machine to one with a more powerful GPU to speed up the training.
B . Change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. Parallelize the training to as many machines as needed to achieve the business goals.
C . Switch to using a built-in AWS SageMaker DeepAR model. Parallelize the training to as many machines as needed to achieve the business goals.
D . Move the training to Amazon EMR and distribute the workload to as many machines as needed to achieve the business goals.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

To improve the training speed of a time-series forecasting model using TensorFlow, the Machine Learning Specialist should change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. Horovod is a free and open-source software framework for distributed deep learning training using TensorFlow, Keras, PyTorch, and Apache MXNet1. Horovod can scale up to hundreds of GPUs with upwards of 90% scaling efficiency2. Horovod is easy to use, as it requires only a few lines of Python code to modify an existing training script2. Horovod is also portable, as it runs the same for TensorFlow, Keras, PyTorch, and MXNet; on premise, in the cloud, and on Apache Spark2.

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly3. Amazon SageMaker supports Horovod as a built-in distributed training framework, which means that the Machine Learning Specialist does not need to install or configure Horovod separately4. Amazon SageMaker also provides a number of features and tools to simplify and optimize the distributed training process, such as automatic scaling, debugging, profiling, and monitoring4. By using Amazon SageMaker, the Machine Learning Specialist can parallelize the training to as many machines as needed to achieve the business goals, while minimizing coding effort and infrastructure changes.

Reference:

1: Horovod (machine learning) – Wikipedia

2: Home – Horovod

3: Amazon SageMaker C Machine Learning Service C AWS

4: Use Horovod with Amazon SageMaker – Amazon SageMaker

Question #98

A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat. The ML Specialist performs

pocsome tests and records the following results for a neural network-based image classifier:

Total number of images available = 1,000 Test set images = 100 (constant test set)

The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside down by their owners.

Which techniques can be used by the ML Specialist to improve this specific test error?

A . Increase the training data by adding variation in rotation for training images.
B . Increase the number of ehs for model training.
C . Increase the number of layers for the neural network.
D . Increase the dropout rate for the second-to-last layer.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

To improve the test error for the image classifier, the Machine Learning Specialist should use the technique of increasing the training data by adding variation in rotation for training images. This technique is called data augmentation, which is a way of artificially expanding the size and diversity of the training dataset by applying various transformations to the original images, such as rotation, flipping, cropping, scaling, etc. Data augmentation can help the model learn more robust features that are invariant to the orientation, position, and size of the objects in the images. This can improve the generalization ability of the model and reduce the test error, especially for cases where the images are not well-aligned or have different perspectives1.

Reference: 1: Image Augmentation – Amazon SageMaker

Question #99

A Data Scientist is developing a machine learning model to classify whether a financial transaction is fraudulent. The labeled data available for training consists of 100,000 non-fraudulent observations and 1,000 fraudulent observations.

The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset.

The accuracy of the model is 99.1%, but the Data Scientist has been asked to reduce the number of false negatives.

Which combination of steps should the Data Scientist take to reduce the number of false positive predictions by the model? (Select TWO.)

A . Change the XGBoost eval_metric parameter to optimize based on rmse instead of error.
B . Increase the XGBoost scale_pos_weight parameter to adjust the balance of positive and negative weights.
C . Increase the XGBoost max_depth parameter because the model is currently underfitting the data.
D . Change the XGBoost evaljnetric parameter to optimize based on AUC instead of error.
E . Decrease the XGBoost max_depth parameter because the model is currently overfitting the data.

Reveal Solution Hide Solution

Correct Answer: B, D
B, D

Explanation:

The XGBoost algorithm is a popular machine learning technique for classification problems. It is based on the idea of boosting, which is to combine many weak learners (decision trees) into a strong learner (ensemble model).

The XGBoost algorithm can handle imbalanced data by using the scale_pos_weight parameter, which controls the balance of positive and negative weights in the objective function. A typical value to consider is the ratio of negative cases to positive cases in the data. By increasing this parameter, the algorithm will pay more attention to the minority class (positive) and reduce the number of false negatives.

The XGBoost algorithm can also use different evaluation metrics to optimize the model performance. The default metric is error, which is the misclassification rate. However, this metric can be misleading for imbalanced data, as it does not account for the different costs of false positives and false negatives. A better metric to use is AUC, which is the area under the receiver operating characteristic (ROC) curve. The ROC curve plots the true positive rate against the false positive rate for different threshold values. The AUC measures how well the model can distinguish between the two classes, regardless of the threshold. By changing the eval_metric parameter to AUC, the algorithm will try to maximize the AUC score and reduce the number of false negatives.

Therefore, the combination of steps that should be taken to reduce the number of false negatives are to increase the scale_pos_weight parameter and change the eval_metric parameter to AUC.

Reference: XGBoost Parameters

XGBoost for Imbalanced Classification

Question #100

A Machine Learning Specialist is assigned a TensorFlow project using Amazon SageMaker for training, and needs to continue working for an extended period with no Wi-Fi access.

Which approach should the Specialist use to continue working?

A . Install Python 3 and boto3 on their laptop and continue the code development using that environment.
B . Download the TensorFlow Docker container used in Amazon SageMaker from GitHub to their local environment, and use the Amazon SageMaker Python SDK to test the code.
C . Download TensorFlow from tensorflow.org to emulate the TensorFlow kernel in the SageMaker environment.
D . Download the SageMaker notebook to their local environment then install Jupyter Notebooks on their laptop and continue the development in a local notebook.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. SageMaker provides a variety of tools and frameworks to support the entire machine learning workflow, from data preparation to model deployment.

One of the tools that SageMaker offers is the Amazon SageMaker Python SDK, which is a high-level library that simplifies the interaction with SageMaker APIs and services. The SageMaker Python SDK allows you to write code in Python and use popular frameworks such as TensorFlow, PyTorch, MXNet, and more. You can use the SageMaker Python SDK to create and manage SageMaker resources such as notebook instances, training jobs, endpoints, and feature store.

If you need to continue working on a TensorFlow project using SageMaker for training without Wi-Fi access, the best approach is to download the TensorFlow Docker container used in SageMaker from GitHub to your local environment, and use the SageMaker Python SDK to test the code. This way, you can ensure that your code is compatible with the SageMaker environment and avoid any potential issues when you upload your code to SageMaker and start the training job. You can also use the same code to deploy your model to a SageMaker endpoint when you have Wi-Fi access again.

To download the TensorFlow Docker container used in SageMaker, you can visit the SageMaker Docker GitHub repository and follow the instructions to build the image locally. You can also use the SageMaker Studio Image Build CLI to automate the process of building and pushing the Docker image to Amazon Elastic Container Registry (Amazon ECR). To use the SageMaker Python SDK to test the code, you can install the SDK on your local machine by following the installation guide. You can also refer to the TensorFlow documentation for more details on how to use the SageMaker Python SDK with TensorFlow.

Reference:

SageMaker Docker GitHub repository

SageMaker Studio Image Build CLI

SageMaker Python SDK installation guide

SageMaker Python SDK TensorFlow documentation