Professional Data Engineer on Google Cloud Platform
Last exam update: Dec 18 ,2024
Page 1 out of 13. Viewing questions 1-15 out of 184
Question 1
You have uploaded 5 years of log data to Cloud Storage. A user reported that some data points in the log data are outside of their expected ranges, which indicates errors You need to address this issue and be able to run the process again in the future while keeping the original data for compliance reasons. What should you do?
A.
Import the data from Cloud Storage into BigQuery. Create a new BigQuery table, and skip the rows with errors.
B.
Create a Compute Engine instance and create a new copy of the data in Cloud Storage. Skip the rows with errors.
C.
Create a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in Cloud Storage.
D.
Create a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to the same dataset in Cloud Storage.
Answer:
A
User Votes:
A
50%
B
50%
C 2 votes
50%
D 2 votes
50%
Discussions
0/ 1000
Question 2
You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users privacy?
A.
Grant the consultant the Viewer role on the project.
B.
Grant the consultant the Cloud Dataflow Developer role on the project.
C.
Create a service account and allow the consultant to log on with it.
D.
Create an anonymized sample of the data for the consultant to work with in a different project.
Answer:
C
User Votes:
A
50%
B
50%
C 1 votes
50%
D 4 votes
50%
Discussions
0/ 1000
omoomisore
8 months, 2 weeks ago
Create an anonymized sample of the data for the consultant to work with in a different project.
Question 3
You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?
A.
Cloud Scheduler
B.
Cloud Dataflow
C.
Cloud Functions
D.
Cloud Composer
Answer:
A
User Votes:
A
50%
B
50%
C
50%
D 3 votes
50%
Discussions
0/ 1000
Question 4
Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?
A.
Use Google Stackdriver Audit Logs to review data access.
B.
Get the identity and access management IIAM) policy of each table
C.
Use Stackdriver Monitoring to see the usage of BigQuery query slots.
D.
Use the Google Cloud Billing API to see what account the warehouse is being billed to.
Answer:
A
User Votes:
A 1 votes
50%
B 2 votes
50%
C
50%
D
50%
Discussions
0/ 1000
Question 5
You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally. You also want to optimize data for range queries on non-key columns. What should you do?
A.
Use Cloud SQL for storage. Add secondary indexes to support query patterns.
B.
Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
C.
Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
D.
Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.
Answer:
C
User Votes:
A
50%
B
50%
C 2 votes
50%
D
50%
Discussions
0/ 1000
Question 6
You have a data stored in BigQuery. The data in the BigQuery dataset must be highly available. You need to define a storage, backup, and recovery strategy of this data that minimizes cost. How should you configure the BigQuery table?
A.
Set the BigQuery dataset to be regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
B.
Set the BigQuery dataset to be regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
C.
Set the BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
D.
Set the BigQuery dataset to be multi-regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
Answer:
B
User Votes:
A
50%
B
50%
C 2 votes
50%
D
50%
Discussions
0/ 1000
Question 7
You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee. How can you make that data available while minimizing cost?
A.
Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.
B.
Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.
C.
Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.
D.
Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.
Answer:
C
User Votes:
A 2 votes
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 8
You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What should you do?
A.
Store and process the entire dataset in BigQuery.
B.
Store and process the entire dataset in Cloud Bigtable.
C.
Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.
D.
Store the warm data as files in Cloud Storage, and store the active data in BigQuery. Keep this ratio as 80% warm and 20% active.
Answer:
D
User Votes:
A 1 votes
50%
B
50%
C 1 votes
50%
D
50%
Discussions
0/ 1000
Question 9
You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?
A.
Deploy small Kafka clusters in your data centers to buffer events.
B.
Have the data acquisition devices publish data to Cloud Pub/Sub.
C.
Establish a Cloud Interconnect between all remote data centers and Google.
D.
Write a Cloud Dataflow pipeline that aggregates all data in session windows.
Answer:
A
User Votes:
A
50%
B 2 votes
50%
C 1 votes
50%
D 1 votes
50%
Discussions
0/ 1000
Question 10
Your company needs to upload their historic data to Cloud Storage. The security rules dont allow access from external IPs to their on-premises resources. After an initial upload, they will add new data from existing on-premises applications every day. What should they do?
A.
Execute gsutil rsync from the on-premises servers.
B.
Use Cloud Dataflow and write the data to Cloud Storage.
C.
Write a job template in Cloud Dataproc to perform the data transfer.
D.
Install an FTP server on a Compute Engine VM to receive the files and move them to Cloud Storage.
Answer:
B
User Votes:
A 2 votes
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 11
You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient. What should you do?
A.
Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
B.
Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
C.
Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.
D.
Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
Answer:
A
User Votes:
A 1 votes
50%
B 2 votes
50%
C
50%
D
50%
Discussions
0/ 1000
Question 12
A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?
A.
Implement clustering in BigQuery on the ingest date column.
B.
Implement clustering in BigQuery on the package-tracking ID column.
C.
Tier older data onto Google Cloud Storage files and create a BigQuery table using GCS as an external data source.
D.
Re-create the table using data partitioning on the package delivery date.
Answer:
B
User Votes:
A
50%
B 2 votes
50%
C
50%
D
50%
Discussions
0/ 1000
Question 13
You are building a real-time prediction engine that streams files, which may contain PII (personal identifiable information) data, into Cloud Storage and eventually into BigQuery. You want to ensure that the sensitive data is masked but still maintains referential integrity, because names and emails are often used as join keys. How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the PII data is not accessible by unauthorized individuals?
A.
Create a pseudonym by replacing the PII data with cryptogenic tokens, and store the non-tokenized data in a locked- down button.
B.
Redact all PII data, and store a version of the unredacted data in a locked-down bucket.
C.
Scan every table in BigQuery, and mask the data it finds that has PII.
D.
Create a pseudonym by replacing PII data with a cryptographic format-preserving token.
Answer:
B
User Votes:
A 2 votes
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 14
You are updating the code for a subscriber to a Pub/Sub feed. You are concerned that upon deployment the subscriber may erroneously acknowledge messages, leading to message loss. Your subscriber is not set up to retain acknowledged messages. What should you do to ensure that you can recover from errors after deployment?
A.
Set up the Pub/Sub emulator on your local machine. Validate the behavior of your new subscriber logic before deploying it to production.
B.
Create a Pub/Sub snapshot before deploying new subscriber code. Use a Seek operation to re-deliver messages that became available after the snapshot was created.
C.
Use Cloud Build for your deployment. If an error occurs after deployment, use a Seek operation to locate a timestamp logged by Cloud Build at the start of the deployment.
D.
Enable dead-lettering on the Pub/Sub topic to capture messages that arent successfully acknowledged. If an error occurs after deployment, re-deliver any messages captured by the dead-letter queue.
You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud Stackdriver to ensure that it is processing data. Which Stackdriver alerts should you create?
A.
An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/used_bytes for the destination
B.
An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/used_bytes for the destination
C.
An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/num_undelivered_messages for the destination
D.
An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/num_undelivered_messages for the destination