Clusters and masters
As you use Red Hat OpenShift on IBM Cloud, consider these techniques for general troubleshooting and debugging your cluster and cluster master.
General ways to resolve issues
- Keep the cluster environment up to date.
- Check monthly for available security and operating system patches to update your worker nodes.
- Update your cluster to the latest default version for OpenShift .
- Make sure that your command line tools are up to date.
- In the terminal, we are notified when updates to the ibmcloud CLI and plug-ins are available. Be sure to keep your CLI up-to-date so that you can use all available commands and flags.
- Make sure that your oc CLI client matches the same Kubernetes version as your cluster server. Kubernetes does not support oc client versions that are 2 or more versions apart from the server version (n +/- 2).
Reviewing issues and status
- To see whether IBM Cloud is available, check the IBM Cloud status page.
- Filter for the Kubernetes Service component.
Running tests with the Diagnostics and Debug Tool
While you troubleshoot, you can use the IBM Cloud Kubernetes Service Diagnostics and Debug Tool to run tests and gather pertinent information from your cluster.Infrastructure provider:
Classic
VPC Generation 2 compute
Before beginning: If you previously installed the debug tool by using Helm, first uninstall the ibmcloud-iks-debug Helm chart.
Find the installation name of your Helm chart.
helm list -n <project> | grep ibmcloud-iks-debugExample output:
<helm_chart_name> 1 Thu Sep 13 16:41:44 2019 DEPLOYED ibmcloud-iks-debug-1.0.0 defaultUninstall the debug tool installation by deleting the Helm chart.
helm uninstall <helm_chart_name> -n <project>Verify that the debug tool pods are removed. When the uninstallation is complete, no pods are returned by the following command.
oc get pod --all-namespaces | grep ibmcloud-iks-debug
To enable and use the Diagnostics and Debug Tool add-on:
In your cluster dashboard, click the name of the cluster where we want to install the debug tool add-on.
Click the Add-ons tab.
On the Diagnostics and Debug Tool card, click Install.
In the dialog box, click Install. Note that it can take a few minutes for the add-on to be installed.
On the Diagnostics and Debug Tool card, click Dashboard.
In the debug tool dashboard, select individual tests or a group of tests to run. Some tests check for potential warnings, errors, or issues, and some tests only gather information that you can reference while you troubleshoot. For more information about the function of each test, click the information icon next to the test's name.
Click Run.
Check the results of each test.
- If any test fails, click the information icon next to the test's name in the left column for information about how to resolve the issue.
- You can also use the results of tests to gather information, such as complete YAMLs, that can help you debug your cluster in the following sections.
Debugging clusters
Review the options to debug your clusters and find the root causes for failures.Infrastructure provider:
Classic
VPC Generation 2 compute
List your cluster and find the State of the cluster.
ibmcloud oc cluster lsReview the State of your cluster. If your cluster is in a Critical, Delete failed, or Warning state, or is stuck in the Pending state for a long time, start debugging the worker nodes.
You can view the current cluster state by running the ibmcloud oc cluster ls command and locating the State field.
Cluster state Description Aborted The deletion of the cluster is requested by the user before the Kubernetes master is deployed. After the deletion of the cluster is completed, the cluster is removed from your dashboard. If your cluster is stuck in this state for a long time, open an IBM Cloud support case. Critical The Kubernetes master cannot be reached or all worker nodes in the cluster are down. If you enabled IBM Key Protect in your cluster, the Key Protect container might fail to encrypt or decrypt your cluster secrets. If so, you can view an error with more information when you run oc get secrets. Delete failed The Kubernetes master or at least one worker node cannot be deleted. List worker nodes by running ibmcloud oc worker ls --cluster <cluster_name_or_ID>. If worker nodes are listed, see Unable to create or delete worker nodes. If no workers are listed, open an IBM Cloud support case. Deleted The cluster is deleted but not yet removed from your dashboard. If your cluster is stuck in this state for a long time, open an IBM Cloud support case. Deleting The cluster is being deleted and cluster infrastructure is being dismantled. You cannot access the cluster. Deploy failed The deployment of the Kubernetes master could not be completed. You cannot resolve this state. Contact IBM Cloud support by opening an IBM Cloud support case. Deploying The Kubernetes master is not fully deployed yet. You cannot access your cluster. Wait until your cluster is fully deployed to review the health of your cluster. Normal All worker nodes in a cluster are up and running. You can access the cluster and deploy apps to the cluster. This state is considered healthy and does not require an action from you. Although the worker nodes might be normal, other infrastructure resources, such as networking and storage, might still need attention. If you just created the cluster, some parts of the cluster that are used by other services such as Ingress secrets or registry image pull secrets, might still be in process.
Pending The Kubernetes master is deployed. The worker nodes are being provisioned and are not available in the cluster yet. You can access the cluster, but you cannot deploy apps to the cluster. Requested A request to create the cluster and order the infrastructure for the Kubernetes master and worker nodes is sent. When the deployment of the cluster starts, the cluster state changes to Deploying. If your cluster is stuck in the Requested state for a long time, open an IBM Cloud support case. Updating The Kubernetes API server that runs in your Kubernetes master is being updated to a new Kubernetes API version. During the update, you cannot access or change the cluster. Worker nodes, apps, and resources that the user deployed are not modified and continue to run. Wait for the update to complete to review the health of your cluster. Unsupported The Kubernetes version that the cluster runs is no longer supported. Your cluster's health is no longer actively monitored or reported. Additionally, you cannot add or reload worker nodes. To continue receiving important security updates and support, we must update your cluster. Review the version update preparation actions, then update your cluster to a supported Kubernetes version. Warning
- At least one worker node in the cluster is not available, but other worker nodes are available and can take over the workload. Try to reload the unavailable worker nodes.
- Your cluster has zero worker nodes, such as if you created a cluster without any worker nodes or manually removed all the worker nodes from the cluster. Resize your worker pool to add worker nodes to recover from a Warning state.
- A control plane operation for your cluster failed. View the cluster in the console or run ibmcloud oc cluster get --cluster <cluster_name_or_ID> to check the Master Status for further debugging.
The OpenShift master is the main component that keeps your cluster up and running. The master stores cluster resources and their configurations in the etcd database that serves as the single point of truth for your cluster. The OpenShift API server is the main entry point for all cluster management requests from the worker nodes to the master, or when we want to interact with your cluster resources.
If a master failure occurs, the workloads continue to run on the worker nodes, but you cannot use oc commands to work with your cluster resources or view the cluster health until the OpenShift API server in the master is back up. If a pod goes down during the master outage, the pod cannot be rescheduled until the worker node can reach the OpenShift API server again.
During a master outage, you can still run ibmcloud oc commands against the IBM Cloud Kubernetes Service API to work with your infrastructure resources, such as worker nodes or VLANs. If you change the current cluster configuration by adding or removing worker nodes to the cluster, your changes do not happen until the master is back up.
Reviewing master health
Infrastructure provider:
Your Red Hat OpenShift on IBM Cloud includes an IBM-managed master with highly available replicas, automatic security patch updates applied for you, and automation in place to recover in case of an incident. You can check the health, status, and state of the cluster master by running ibmcloud oc cluster get --cluster <cluster_name_or_ID>.
Classic
VPC Generation 2 compute
Master Health
The Master Health reflects the state of master components and notifies you if something needs your attention. The health might be one of the following:
- error: The master is not operational. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master is normal. You can also open an IBM Cloud support case.
- normal: The master is operational and healthy. No action is required.
- unavailable: The master might not be accessible, which means some actions such as resizing a worker pool are temporarily unavailable. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master is normal.
- unsupported: The master runs an unsupported version of Kubernetes. You must update your cluster to return the master to normal health.
Master Status and State
The Master Status provides details of what operation from the master state is in progress. The status includes a timestamp of how long the master has been in the same state, such as Ready (1 month ago). The Master State reflects the lifecycle of possible operations that can be performed on the master, such as deploying, updating, and deleting. Each state is described in the following table.
Master state Description deployed The master is successfully deployed. Check the status to verify that the master is Ready or to see if an update is available. deploying The master is currently deploying. Wait for the state to become deployed before working with your cluster, such as adding worker nodes. deploy_failed The master failed to deploy. IBM Support is notified and works to resolve the issue. Check the Master Status field for more information, or wait for the state to become deployed. deleting The master is currently deleting because you deleted the cluster. You cannot undo a deletion. After the cluster is deleted, you can no longer check the master state because the cluster is completely removed. delete_failed The master failed to delete. IBM Support is notified and works to resolve the issue. You cannot resolve the issue by trying to delete the cluster again. Instead, check the Master Status field for more information, or wait for the cluster to delete. You can also open an IBM Cloud support case. updating The master is updating its Kubernetes version. The update might be a patch update that is automatically applied, or a minor or major version that you applied by updating the cluster. During the update, your highly available master can continue processing requests, and your app workloads and worker nodes continue to run. After the master update is complete, you can update your worker nodes.
If the update is unsuccessful, the master returns to a deployed state and continues running the previous version. IBM Support is notified and works to resolve the issue. You can check if the update failed in the Master Status field.update_cancelled The master update is canceled because the cluster was not in a healthy state at the time of the update. Your master remains in this state until your cluster is healthy and you manually update the master. To update the master, use the ibmcloud oc cluster master update command.If you do not want to update the master to the default major.minor version during the update, include the --version flag and specify the latest patch version that is available for the major.minor version that we want, such as 1.18.9. To list available versions, run ibmcloud oc versions. update_failed The master update failed. IBM Support is notified and works to resolve the issue. You can continue to monitor the health of the master until the master reaches a normal state. If the master remains in this state for more than 1 day, open an IBM Cloud support case. IBM Support might identify other issues in your cluster that we must fix before the master can be updated.
Debugging OpenShift web console, OperatorHub, internal registry, and other components
Infrastructure provider:
OpenShift clusters have many built-in components that work together to simplify the developer experience. For example, you can use the OpenShift web console to manage and deploy your cluster workloads, or enable 3rd-party operators from the OperatorHub to enhance your cluster with a service mesh and other capabilities.
Classic
VPC Generation 2 compute
Commonly used components include the following:
- OpenShift web console in the openshift-console project
- OperatorHub in the openshift-marketplace project
- Internal registry in the openshift-image-registry project
If these components fail, review the following debug steps.
Some components, such as the OperatorHub, are available only in clusters that run OpenShift version 4, or run in different projects in version 3.11. You can still troubleshoot OpenShift components in 3.11 clusters, but the project and resource names might vary.
- Check that your IBM Cloud account is set up properly. Some common scenarios that can prevent the default components from running properly include the following:
- If we have a firewall, make sure that open the required ports and IP addresses in the firewall so that you do not block any ingress or egress traffic for the OperatorHub or other OpenShift components.
- If your cluster has multiple zones, or if we have a VPC cluster, make sure that you enable VRF or VLAN spanning. To check if VRF is already enabled, run ibmcloud account show. To check if VLAN spanning is enabled, run ibmcloud oc vlan-spanning get.
- Make sure that your account does not use multifactor authentication (MFA). For more information, see Disabling required MFA for all users in your account.
VPC clusters: Check that a public gateway is enabled on each VPC subnet that your cluster is attached to. Public gateway are required for default components such as the web console and OperatorHub to use a secure, public connection to complete actions such as pulling images from remote, private registries.
- Use the IBM Cloud console or CLI to ensure that a public gateway is enabled on each subnet that your cluster is attached to.
- Restart the components for the Developer catalog in the web console.
- Edit the configmap for the samples operator.
oc edit configs.samples.operator.openshift.io/cluster- Change the value of managementState from Removed to Managed.
- Save and close the config map. Your changes are automatically applied.
Check that your cluster is set up properly. If you just created your cluster, wait awhile for your cluster components to fully provision.
- Get the details of your cluster.
ibmcloud oc cluster get -c <cluster_name_or_ID>- Review the output of the previous step to check the Ingress Subdomain.
- If your cluster does not have a subdomain, see No Ingress subdomain exists after cluster creation.
- If your cluster does have a subdomain, continue to the next step.
Verify that your cluster runs the latest Version. If your cluster does not run the latest version, update the cluster and worker nodes.
Update the cluster master to the latest version.
4.4:
ibmcloud oc cluster master update -c <cluster_name_or_ID> --version 4.4_openshift -f4.3:
ibmcloud oc cluster master update -c <cluster_name_or_ID> --version 4.4_openshift -f3.11:
ibmcloud oc cluster master update -c <cluster_name_or_ID> --version 3.11_openshift -fList your worker nodes.
ibmcloud oc worker ls -c <cluster_name_or_ID>- Update the worker nodes to match the cluster master version.
ibmcloud oc worker update -c <cluster_name_or_ID> -w <worker1_ID> -w <worker2_ID> -w <worker3_ID>- Check the cluster State. If the state is not normal, see Debugging clusters.
- Check the Master health. If the state is not normal, see Reviewing master health.
- Check the worker nodes that the OpenShift components might run on. If the state is not normal, see Debugging worker nodes.
ibmcloud oc worker ls -c <cluster_name_or_ID>- Log in to your cluster. Note that if the OpenShift web console does not work for you to get the login token, you can access the cluster from the CLI.
Check the health of the OpenShift component pods that do not work.
- Check the status of the pod.
oc get pods -n <project>If a pod is not in a Running status, describe the pod and check for the events. For example, we might see an error that the pod cannot be scheduled because of a lack of CPU or memory resources, which is common if you have a cluster with less than 3 worker nodes. Resize your worker pool and try again.
oc describe pod -n <project> <pod>If you do not see any helpful information in the events section, check the pod logs for any error messages or other troubleshooting information.
oc logs pod -n <project> <pod>- Restart the pod and check if it reaches a Running status.
oc delete pod -n <project> <pod>If the pods are healthy, check if other system pods are experiencing issues. Oftentimes to function properly, one component depends on another component to be healthy. For example, the OperatorHub has a set of images that are stored in external registries such as quay.io. These images are pulled into the internal registry to use across the projects in your OpenShift cluster. If any of the OperatorHub or internal registry components are not set up properly, such as due to lack of permissions or compute resources, the OperatorHub and catalog do not display.
- Check for pending pods.
oc get pods --all-namespaces | grep PendingDescribe the pods and check for the Events.
oc describe pod -n <project_name> <pod_name>For example, some common messages that we might see from openshift-image-registry pods include:
- A Volume could not be created error message because you created the cluster without the correct storage permission. Red Hat OpenShift on IBM Cloud clusters come with a file storage device by default to store images for the system and other pods. Revise your infrastructure permissions and restart the pod.
- An order will exceed maximum number of storage volumes allowed error message because we have exceeded the combined quota of file and block storage devices that are allowed per account. Remove unused storage devices or increase your storage quota, and restart the pod.
- A message that images cannot be stored because the file storage device is full. Resize the storage device and restart the pod.
- A Pull image still failed due to error: unauthorized: authentication required error message because the internal registry cannot pull images from an external registry. Check that the image pull secrets are set for the project and restart the pod.
- Check the Node that the failing pods run on. If all the pods run on the same worker node, the worker node might have a network connectivity issue. Reload the worker node.
ibmcloud oc worker reload -c <cluster_name_or_ID> -w <worker_node_ID>- Check that the OpenVPN in the cluster is set up properly.
- Check that the OpenVPN pod is Running.
oc get pods -n kube-system -l app=vpn- Check the OpenVPN logs, and check for an ERROR message such as WORKERIP:<port>, such as WORKERIP:10250, that indicates that the VPN tunnel does not work.
oc logs -n kube-system <vpn_pod> --tail 10- If you see the worker IP error, check if worker-to-worker communication is broken. Log in to a calico-node pod in the calico-system project, and check for the same WORKERIP:10250 error.
oc exec -n calico-system <calico-node_pod> -- date- If the worker-to-worker communication is broken, make sure that you enable VRF or VLAN spanning.
- If you see a different error from either the OpenVPN or calico-node pod, restart the OpenVPN pod.
oc delete pod -n kube-system <vpn_pod>- If the OpenVPN still fails, check the worker node that the pod runs on.
oc describe pod -n kube-system <vpn_pod> | grep "Node:"- Cordon the worker node so that the OpenVPN pod is rescheduled to a different worker node.
oc cordon <worker_node>- Check the OpenVPN pod logs again. If the pod no longer has an error, the worker node might have a network connectivity issue. Reload the worker node.
ibmcloud oc worker reload -c <cluster_name_or_ID> -w <worker_node_ID>- Refresh the cluster master to set up the default OpenShift components. After you refresh the cluster, wait a few minutes to allow the operation to complete.
ibmcloud oc cluster master refresh -c <cluster_name_or_ID>- Try to use the OpenShift component again. If the error still exists, see Feedback, questions, and support.
Common CLI issues
Review the following common reasons for CLI connection issues or command failures.Infrastructure provider:
Classic
VPC Generation 2 compute
Firewall prevents running CLI commands
What’s happening
When you run ibmcloud, kubectl,oc, or calicoctl commands from the CLI, they fail.
Why it’s happening
We might have corporate network policies that prevent access from the local system to public endpoints via proxies or firewalls.
How to fix it
Allow TCP access for the CLI commands to work. This task requires the Administrator IBM Cloud IAM platform role for the cluster.
kubectl or oc commands do not work
What’s happening
When you run kubectl or oc commands against your cluster, your commands fail with an error message similar to the following.
No resources found. Error from server (NotAcceptable): unknown (get nodes)invalid object doesn't have additional propertieserror: No Auth Provider found for name "oidc"
Why it’s happening
You have a different version of kubectl than your cluster version. Kubernetes does not support kubectl client versions that are 2 or more versions apart from the server version (n +/- 2). If you use a community Kubernetes cluster, we might also have the OpenShift version of kubectl, which does not work with community Kubernetes clusters.
To check your client kubectl version against the cluster server version, run oc version --short.
How to fix it
Install the version of kubectl that matches the Kubernetes version of your cluster.
If we have multiple clusters at different Kubernetes versions or different container platforms such as OpenShift , download each kubectl version binary file to a separate directory. Then, you can set up an alias in the local terminal profile to point to the kubectl binary file directory that matches the kubectl version of the cluster that we want to work with, or we might be able to use a tool such as brew switch kubernetes-cli <major.minor>.
Time out when trying to connect to a pod
What’s happening
You try to connect to a pod, such as logging in with oc exec or getting logs with oc logs. The pod is healthy, but you see an error message similar to the following.
Error from server: Get https://<10.xxx.xx.xxx>:<port>/<address>: dial tcp <10.xxx.xx.xxx>:<port>: connect: connection timed out
Why it’s happening
The OpenVPN server is experiencing configuration issues that prevent accessing the pod from its internal address.
How to fix it
Before beginning: Access your OpenShift cluster.
- Check if a cluster and worker node updates are available by viewing your cluster and worker node details in the console or a cluster ls or worker ls command. If so, update your cluster and worker nodes to the latest version.
- Restart the OpenVPN pod by deleting it. Another VPN pod is scheduled. When its STATUS is Running, try to connect the pod that you previously could not connect to.
oc delete pod -n kube-system -l app=vpn
Missing projects or oc and kubectl commands fail
Infrastructure provider:
Classic
VPC Generation 2 compute
What’s happening
You do not see all the projects that we have access to. When you try to run oc or kubectl commands, you see an error similar to the following.
No resources found. Error from server (Forbidden): <resource> is forbidden: User "IAM#user@email.com" cannot list <resources> at the cluster scope: no RBAC policy matched
Why it’s happening
You need to download the admin configuration files for your cluster in order to run commands that require the cluster-admin cluster role.
How to fix it
Run ibmcloud oc cluster config --cluster <cluster_name_or_ID> --admin and try again.
Unable to create or delete worker nodes or clusters
You cannot perform infrastructure-related commands on your cluster, such as:
- Adding worker nodes in an existing cluster or when creating a new cluster
- Removing worker nodes
- Reloading or rebooting worker nodes
- Resizing worker pools
- Updating your cluster
- Deleting your cluster
Review the error messages in the following sections to troubleshoot infrastructure-related issues that are caused by incorrect cluster permissions, orphaned clusters in other infrastructure accounts, or a time-based one-time passcode (TOTP) on the account.
Unable to create or delete worker nodes due to permission errors
What’s happening
You cannot manage worker nodes for your cluster, and you receive an error message similar to one of the following.
We were unable to connect to your IBM Cloud infrastructure account. Creating a standard cluster requires that we have either a Pay-As-You-Go account that is linked to an IBM Cloud infrastructure account term or that we have used the IBM Cloud Kubernetes Service CLI to set your IBM Cloud Infrastructure API keys.'Item' must be ordered with permission.The worker node instance '<ID>' cannot be found. Review '<provider>' infrastructure user permissions.The worker node instance cannot be found. Review '<provider>' infrastructure user permissions.The worker node instance cannot be identified. Review '<provider>' infrastructure user permissions.The IAM token exchange request failed with the message: <message> IAM token exchange request failed: <message>The cluster could not be configured with the registry. Make sure that we have the Administrator role for IBM Cloud Container Registry.
Why it’s happening
The infrastructure credentials that are set for the region and resource group are missing the appropriate infrastructure permissions. The user's infrastructure permissions are most commonly stored as an API key for the region and resource group. More rarely, if you use a different IBM Cloud account type, we might have set infrastructure credentials manually.
How to fix it
The account owner must set up the infrastructure account credentials properly. The credentials depend on what type of infrastructure account we are using.
Before beginning, Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster..
Identify what user credentials are used for the region and resource group's infrastructure permissions.
Check the API key for a region and resource group of the cluster.
ibmcloud oc api-key info --cluster <cluster_name_or_ID>Example output:
Getting information about the API key owner for cluster <cluster_name>... OK Name Email <user_name> <name@email.com>Check if the classic infrastructure account for the region and resource group is manually set to use a different IBM Cloud infrastructure account.
ibmcloud oc credential get --region <us-south>Example output if credentials are set to use a different classic account. In this case, the user's infrastructure credentials are used for the region and resource group that you targeted, even if a different user's credentials are stored in the API key that you retrieved in the previous step.
OK Infrastructure credentials for user name <1234567_name@email.com> set for resource group <resource_group_name>.Example output if credentials are not set to use a different classic account. In this case, the API key owner that you retrieved in the previous step has the infrastructure credentials that are used for the region and resource group.
FAILED No credentials set for resource group <resource_group_name>.: The user credentials could not be found. (E0051)Validate the infrastructure permissions that the user has.
List the suggested and required infrastructure permissions for the region and resource group.
ibmcloud oc infra-permissions get --region <region>For console and CLI commands to assign these permissions, see Classic infrastructure roles.
- Make sure that the infrastructure credentials owner for the API key or the manually-set account has the correct permissions.
- If necessary, you can change the API key or manually-set infrastructure credentials owner for the region and resource group.
Test that the changed permissions permit authorized users to perform infrastructure operations for the cluster.
- For example, we might try to a delete a worker node.
ibmcloud oc worker rm --cluster <cluster_name_or_ID> --worker <worker_node_ID>Check to see if the worker node is removed.
ibmcloud oc worker get --cluster <cluster_name_or_ID> --worker <worker_node_ID>Example output if the worker node removal is successful. The worker get operation fails because the worker node is deleted. The infrastructure permissions are correctly set up.
FAILED The specified worker node could not be found. (E0011)If the worker node is not removed, review that State and Status fields and the common issues with worker nodes to continue debugging.
- If you manually set credentials and still cannot see the cluster's worker nodes in your infrastructure account, we might check whether the cluster is orphaned.
Unable to create or delete worker nodes due to incorrect account error
Infrastructure provider:
Classic
What’s happening
You cannot manage worker nodes for your cluster, or view the cluster worker nodes in your classic IBM Cloud infrastructure account. However, you can update and manage other clusters in the account.
Further, you verified that we have the proper infrastructure credentials.
We might receive an error message in your worker node status similar to the following.
Incorrect account for worker - The 'classic' infrastructure user credentials changed and no longer match the worker node instance infrastructure account.
Why it’s happening
The cluster might be provisioned in a classic IBM Cloud infrastructure account that is no longer linked to your Red Hat OpenShift on IBM Cloud account. The cluster is orphaned. Because the resources are in a different account, you do not have the infrastructure credentials to modify the resources.
Consider the following example scenario to understand how clusters might become orphaned.
- You have an IBM Cloud Pay-As-You-Go account.
- You create a cluster named Cluster1. The worker nodes and other infrastructure resources are provisioned into the infrastructure account that comes with your Pay-As-You-Go account.
- Later, we find out that your team uses a legacy or shared classic IBM Cloud infrastructure account. You use the ibmcloud oc credential set command to change the IBM Cloud infrastructure credentials to use your team account.
- You create another cluster named Cluster2. The worker nodes and other infrastructure resources are provisioned into the team infrastructure account.
- You notice that Cluster1 needs a worker node update, a worker node reload, or you just want to clean it up by deleting it. However, because Cluster1 was provisioned into a different infrastructure account, you cannot modify its infrastructure resources. Cluster1 is orphaned.
- You follow the resolution steps in the following section, but do not set your infrastructure credentials back to your team account. You can delete Cluster1, but now Cluster2 is orphaned.
- You change your infrastructure credentials back to the team account that created Cluster2. Now, you no longer have an orphaned cluster!
How to fix it
- Check which infrastructure account the region that your cluster is in currently uses to provision clusters.
- Log in to the Red Hat OpenShift on IBM Cloud clusters console.
- From the table, select your cluster.
- In the Overview tab, check for an Infrastructure User field. This field helps you determine if your Red Hat OpenShift on IBM Cloud account uses a different infrastructure account than the default.
- If you do not see the Infrastructure User field, we have a linked Pay-As-You-Go account that uses the same credentials for your infrastructure and platform accounts. The cluster that cannot be modified might be provisioned in a different infrastructure account.
- If you see an Infrastructure User field, you use a different infrastructure account than the one that came with your Pay-As-You-Go account. These different credentials apply to all clusters within the region. The cluster that cannot be modified might be provisioned in your Pay-As-You-Go or a different infrastructure account.
- Check which infrastructure account was used to provision the cluster.
- In the Worker Nodes tab, select a worker node and note its ID.
- Open the menu
and click Classic Infrastructure.
- From the infrastructure navigation pane, click Devices > Device List.
- Search for the worker node ID that you previously noted.
- If you do not find the worker node ID, the worker node is not provisioned into this infrastructure account. Switch to a different infrastructure account and try again.
- Use the ibmcloud oc credential set command to change your infrastructure credentials to the account that the cluster worker nodes are provisioned in, which you found in the previous step. If you no longer have access to the infrastructure credentials, you can open an IBM Cloud support case to determine an email address for the administrator of the other infrastructure account. However, IBM Cloud Support cannot remove the orphaned cluster for you, and we must contact the administrator of the other account to get the infrastructure credentials.
- Delete the cluster.
- If we want, reset the infrastructure credentials to the previous account. Note that if you created clusters with a different infrastructure account than the account that you switch to, we might orphan those clusters.
- If you did not see the Infrastructure User field in step 1, you can use the ibmcloud oc credential unset --region <region> command to resume using the default credentials that come with your IBM Cloud Pay-As-You-Go account.
- If you did see the Infrastructure User field in step 1, you can use the ibmcloud oc credential set command to set credentials to that infrastructure account.
Unable to create or delete worker nodes due to paid account error
Infrastructure provider:
Classic
What’s happening
You cannot manage worker nodes for your cluster, and you receive an error message similar to one of the following.
Unable to connect to the IBM Cloud account. Ensure that we have a paid account.
Why it’s happening
Your IBM Cloud account uses its own automatically linked infrastructure through a Pay-as-you-Go account. However, the account administrator enabled the time-based one-time passcode (TOTP) option so that users are prompted for a time-based one-time passcode (TOTP) at login. This type of multifactor authentication (MFA) is account-based, and affects all access to the account. TOTP MFA also affects the access that IBM Cloud Kubernetes Service requires to make calls to IBM Cloud infrastructure. If TOTP is enabled for the account, you cannot create and manage clusters and worker nodes in IBM Cloud Kubernetes Service.
How to fix it
The IBM Cloud account owner or an account administrator must either:
- Disable TOTP for the account, and continue to use the automatically linked infrastructure credentials for IBM Cloud Kubernetes Service.
- Continue to use TOTP, but create an infrastructure API key that IBM Cloud Kubernetes Service can use to make direct calls to the IBM Cloud infrastructure API.
To disable TOTP MFA for the account:
- Log in to the IBM Cloud console. From the menu bar, select Manage > Access (IAM).
- In the left navigation, click the Settings page.
- Under Multifactor authentication, click Edit.
- Select None, and click Update.
To use TOTP MFA and create an infrastructure API key for IBM Cloud Kubernetes Service:
- From the IBM Cloud console, select Manage > Access (IAM) > Users and click the name of the account owner. Note: If you do not use the account owner's credentials, first ensure that the user whose credentials you use has the correct permissions.
- In the API Keys section, find or create a classic infrastructure API key.
- Use the infrastructure API key to set the infrastructure API credentials for IBM Cloud Kubernetes Service. Repeat this command for each region where you create clusters.
ibmcloud oc credential set classic --infrastructure-username <infrastructure_API_username> --infrastructure-api-key <infrastructure_API_authentication_key> --region <region>- Verify that the correct credentials are set.
ibmcloud oc credential get --region <region>Example output:Infrastructure credentials for user name user@email.com set for resource group default.- To ensure that existing clusters use the updated infrastructure API credentials, run ibmcloud oc api-key reset --region <region> in each region where we have clusters.
Unable to create a cluster in the console due to No VPC is available error
Infrastructure provider:
VPC Generation 2 compute
What’s happening
You try to create a VPC cluster by using the Red Hat OpenShift on IBM Cloud console. You have an existing VPC for Generation 1 compute in your account, but when you try to select an existing Virtual Private Cloud to create the cluster in, you see the following error message:
No VPC is available. Create a VPC.
Why it’s happening
During cluster creation, the Red Hat OpenShift on IBM Cloud console uses the API key that is set for the default resource group to list the VPCs that are available in your IBM Cloud account. If no API key is set for the default resource group, no VPCs are listed in the Red Hat OpenShift on IBM Cloud console, even if your VPC exists in a different resource group and an API key is set for that resource group.
How to fix it
To set an API key for the default resource group, use the Red Hat OpenShift on IBM Cloud CLI.
Log in to the terminal as the account owner. If we want a different user than the account owner to set the API key, first ensure that the API key owner has the correct permissions.
ibmcloud login [--sso]Target the default resource group.
ibmcloud target -g defaultSet the API key for the region and resource group.
ibmcloud oc api-key reset --region <region>In the Red Hat OpenShift on IBM Cloud console, click Refresh VPCs. Your available VPCs are now listed in a drop-down menu.
Cluster create error about cloud object storage bucket
Infrastructure provider:
VPC Generation 2 compute
What’s happening
When you create a cluster, you see an error message similar to the following.
Could not store the cloud object storage bucket and IAM service key.Could not find the specified cloud object storage instance.Could not create an IAM service key to access the cloud object storage bucket '{{.Name}}'.Could not create a bucket in your cloud object storage instance.Your cluster is created, but the internal registry is not backed up to cloud object storage. For more information, see 'http://ibm.biz/roks_cos_ts'.
Why it’s happening
When you create a Red Hat OpenShift on IBM Cloud version 4 cluster on VPC generation 2 compute infrastructure, a bucket is automatically created in a standard IBM Cloud Object Storage instance that you select in your account. However, the bucket might not create for several reasons such as:
- IBM Cloud Object Storage is temporarily unavailable.
- No standard IBM Cloud Object Storage instance exists in your account.
- The person who created your cluster did not have the Administrator platform role to IBM Cloud Object Storage in IAM.
- The service failed to set up service key access to the object storage instance, such as if the API key lacks permissions or IBM Cloud IAM is unavailable.
- Other conflicts, such as naming conflicts that exhaust the preset number of retries or saving the bucket and service key data in the backend service.
Your cluster is still created, but the internal registry is not backed up to IBM Cloud Object Storage. Instead, data is saved to the emptyDir directory on the local worker nodes, which is not persistent storage.
How to fix it
Manually set up your cluster to back up the internal registry to an IBM Cloud Object Storage bucket.
- Log in to your account. If applicable, target the appropriate resource group. Set the context for your cluster.
- If corporate network policies prevent access from the local system to public endpoints via proxies or firewalls, allow access to the IBM Cloud Object Storage subdomain.
- Create a standard IBM Cloud Object Storage service, at least one bucket, and HMAC service credentials.
Create a Kubernetes secret in the openshift-image-registry namespace that uses your COS access_key_id and secret_access_key.
oc create secret generic image-registry-private-configuration-user --from-literal=REGISTRY_STORAGE_S3_ACCESSKEY=<access_key_id> --from-literal=REGISTRY_STORAGE_S3_SECRETKEY=<secret_access_key> --namespace openshift-image-registryEdit the OpenShift Registry Operator to use IBM Cloud Object Storage as a backing store.
oc edit configs.imageregistry.operator.openshift.io/clusterAdd the following parameters to the spec section of the configmap, then save and close the file. To pick up the configuration change, the openshift-image-registry pods automatically restart.
storage: s3: bucket: <bucket_name> # Example: my-bucket encrypt: false keyID: "" region: <region> # Example: us-east regionEndpoint: s3.<region>.cloud-object-storage.appdomain.cloudVerify that the internal registry images are backed up to IBM Cloud Object Storage.
- Build an image for your app and push it to IBM Cloud Container Registry.
- Import the image into your internal OpenShift registry.
- Deploy an app that references your image.
- From the IBM Cloud console resource list, select your Cloud Object Storage instance.
- From the menu, click Buckets, then click the bucket that you used for your Red Hat OpenShift on IBM Cloud cluster.
- Review the recent Objects to see your backed up images from the internal registry of your Red Hat OpenShift on IBM Cloud cluster.
Cluster create error cannot pull images from IBM Cloud Container Registry
Infrastructure provider:
Classic
VPC Generation 2 compute
What’s happening
When you created a cluster, you received an error message similar to the following.
Your cluster cannot pull images from the IBM Cloud Container Registry 'icr.io' domains because an IAM access policy could not be created. Make sure that we have the IAM Administrator platform role to IBM Cloud Container Registry. Then, create an image pull secret with IAM credentials to the registry by running 'ibmcloud ks cluster pull-secret apply'.
Why it’s happening
During cluster creation, a service ID is created for your cluster and assigned the Reader service access policy to IBM Cloud Container Registry. Then, an API key for this service ID is generated and stored in an image pull secret to authorize the cluster to pull images from IBM Cloud Container Registry.
To successfully assign the Reader service access policy to the service ID during cluster creation, we must have the Administrator platform access policy to IBM Cloud Container Registry.
Steps:
- Make sure that the account owner gives you the Administrator role to IBM Cloud Container Registry.
ibmcloud iam user-policy-create <your_user_email> --service-name container-registry --roles Administrator- Use the ibmcloud oc cluster pull-secret apply command to re-create an image pull secret with the appropriate registry credentials.
Cluster cannot update because of broken webhook
Infrastructure provider:
Classic
VPC Generation 2 compute
What’s happening
During a master operation such as updating your cluster version, the cluster had a broken webhook application. Now, master operations cannot complete. You see an error similar to the following:
Cannot complete cluster master operations because the cluster has a broken webhook application. For more information, see the troubleshooting docs: 'https://ibm.biz/master_webhook'
Why it’s happening
Your cluster has configurable Kubernetes webhook resources, validating or mutating admission webhooks, that can intercept and modify requests from various services in the cluster to the API server in the cluster master. Because webhooks can change or reject requests, broken webhooks can impact the functionality of the cluster in various ways, such as preventing you from updating the master version or other maintenance operations. For more information, see the Dynamic Admission Control in the Kubernetes documentation.
Potential causes for broken webhooks include:
- The underlying resource that issues the request is missing or unhealthy, such as a Kubernetes service, endpoint, or pod.
- The webhook is part of an add-on or other plug-in application that did not install correctly or is unhealthy.
- Your cluster might have a networking connectivity issue that prevents the webhook from communicating with the Kubernetes API server in the cluster master.
How to fix it
Identify and restore the resource that causes the broken webhook.
Create a test pod to get an error that identifies the broken webhook. The error message might have the name of the broken webhook.
oc run webhook-test --generator=run-pod/v1 --image pause:latestIn the following example, the webhook is trust.hooks.securityenforcement.admission.cloud.ibm.com.
Error from server (InternalError): Internal error occurred: failed calling webhook "trust.hooks.securityenforcementadmission.cloud.ibm.com": Post https://ibmcloud-image-enforcement.ibm-system.svc:443/mutating-pods?timeout=30s: dialtcp 172.21.xxx.xxx:443: connect: connection timed outGet the name of the broken webhook.
If the error message has a broken webhook, replace trust.hooks.securityenforcement.admission.cloud.ibm.com with the broken webhook that you previously identified.
oc get mutatingwebhookconfigurations,validatingwebhookconfigurations -o jsonpath='{.items[?(@.webhooks[*].name=="trust.hooks.securityenforcement.admission.cloud.ibm.com")].metadata.name}{"\n"}'Example output:
image-admission-config- If the error does not have a broken webhook, list all the webhooks in your cluster and check their configurations in the following steps.
oc get mutatingwebhookconfigurations,validatingwebhookconfigurationsReview the service and location details of the mutating or validating webhook configuration in the clientConfig section in the output of the following command. Replace image-admission-config with the name that you previously identified. If the webhook exists outside the cluster, contact the cluster owner to check the webhook status.
oc get mutatingwebhookconfiguration image-admission-config -o yamloc get validatingwebhookconfigurations image-admission-config -o yamlExample output:
clientConfig: caBundle: <redacted> service: name: <name> namespace: <namespace> path: /inject port: 443Optional: Back up the webhooks, especially if you do not know how to reinstall the webhook.
oc get mutatingwebhookconfiguration <name> -o yaml > mutatingwebhook-backup.yamloc get validatingwebhookconfiguration <name> -o yaml > validatingwebhook-backup.yaml- Check the status of the related service and pods for the webhook.
- Check the service Type, Selector, and Endpoint fields.
oc describe service -n <namespace> <service_name>- If the service type is ClusterIP, check that the OpenVPN pod is in a Running status so that the webhook can connect securely to the Kubernetes API in the cluster master. If the pod is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot. For more information, see Debugging app deployments.
oc describe pods -n kube-system -l app=vpn- If the service does not have an endpoint, check the health of the backing resources, such as a deployment or pod. If the resource is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot. For more information, see Debugging app deployments.
oc get all -n my-service-namespace -l <key=value>- If the service does not have any backing resources, or if troubleshooting the pods does not resolve the issue, remove the webhook.
oc delete mutatingwebhook <name>- Retry the cluster master operation, such as updating the cluster.
- If you still see the error, we might have worker node or network connectivity issues.
- Worker node troubleshooting.
- Make sure that the webhook can connect to the Kubernetes API server in the cluster master. For example, if you use Calico network policies, security groups, or some other type of firewall, set up your classic or VPC cluster with the appropriate access.
- If the webhook is managed by an add-on that you installed, uninstall the add-on. Common add-ons that cause webhook issues include the following:
- Re-create the webhook or reinstall the add-on.
Cluster remains in a pending State
Infrastructure provider:
Classic
VPC Generation 2 compute
What’s happening
When we deploy your cluster, it remains in a pending state and doesn't start.
Why it’s happening
If you just created the cluster, the worker nodes might still be configuring. If you already wait for a while, we might have an invalid VLAN.
You can try one of the following solutions:
- Check the status of your cluster by running ibmcloud oc cluster ls. Then, check to be sure that your worker nodes are deployed by running ibmcloud oc worker ls --cluster <cluster_name>.
- Check to see whether your VLAN is valid. To be valid, a VLAN must be associated with infrastructure that can host a worker with local disk storage. You can list your VLANs by running ibmcloud oc vlan ls --zone <zone> if the VLAN does not show in the list, then it is not valid. Choose a different VLAN.
Unable to view or work with a cluster
Infrastructure provider:
Classic
VPC Generation 2 compute
What’s happening
- You are not able to find a cluster. When you run ibmcloud oc cluster ls, the cluster is not listed in the output.
- You are not able to work with a cluster. When you run ibmcloud oc cluster config or other cluster-specific commands, the cluster is not found.
Why it’s happening
In IBM Cloud, each resource must be in a resource group. For example, cluster mycluster might exist in the default resource group. When the account owner gives you access to resources by assigning you an IBM Cloud IAM platform role, the access can be to a specific resource or to the resource group. When we are given access to a specific resource, you don't have access to the resource group. In this case, you don't need to target a resource group to work with the clusters we have access to. If you target a different resource group than the group that the cluster is in, actions against that cluster can fail. Conversely, when we are given access to a resource as part of your access to a resource group, we must target a resource group to work with a cluster in that group. If you don't target your CLI session to the resource group that the cluster is in, actions against that cluster can fail.
If you cannot find or work with a cluster, we might be experiencing one of the following issues:
- You have access to the cluster and the resource group that the cluster is in, but your CLI session is not targeted to the resource group that the cluster is in.
- You have access to the cluster, but not as part of the resource group that the cluster is in. Your CLI session is targeted to this or another resource group.
- You don't have access to the cluster.
How to fix it
To check your user access permissions:
List all of your user permissions.
ibmcloud iam user-policies <your_user_name>Check if we have access to the cluster and to the resource group that the cluster is in.
- Look for a policy that has a Resource Group Name value of the cluster's resource group and a Memo value of Policy applies to the resource group. If we have this policy, we have access to the resource group. For example, this policy indicates that a user has access to the test-rg resource group:
Policy ID: 3ec2c069-fc64-4916-af9e-e6f318e2a16c Roles: Viewer Resources: Resource Group ID 50c9b81c983e438b8e42b2e8eca04065 Resource Group Name test-rg Memo Policy applies to the resource group- Look for a policy that has a Resource Group Name value of the cluster's resource group, a Service Name value of containers-kubernetes or no value, and a Memo value of Policy applies to the resource(s) within the resource group. If you this policy, we have access to clusters or to all resources within the resource group. For example, this policy indicates that a user has access to clusters in the test-rg resource group:
Policy ID: e0ad889d-56ba-416c-89ae-a03f3cd8eeea Roles: Administrator Resources: Resource Group ID a8a12accd63b437bbd6d58fb6a462ca7 Resource Group Name test-rg Service Name containers-kubernetes Service Instance Region Resource Type Resource Memo Policy applies to the resource(s) within the resource group- If we have both of these policies, skip to Step 4, first bullet. If you don't have the policy from Step 2a, but you do have the policy from Step 2b, skip to Step 4, second bullet. If you do not have either of these policies, continue to Step 3.
Check if we have access to the cluster, but not as part of access to the resource group that the cluster is in.
- Look for a policy that has no values besides the Policy ID and Roles fields. If we have this policy, we have access to the cluster as part of access to the entire account. For example, this policy indicates that a user has access to all resources in the account:
Policy ID: 8898bdfd-d520-49a7-85f8-c0d382c4934e Roles: Administrator, Manager Resources: Service Name Service Instance Region Resource Type Resource- Look for a policy that has a Service Name value of containers-kubernetes and a Service Instance value of the cluster's ID. You can find a cluster ID by running ibmcloud oc cluster get --cluster <cluster_name>. For example, this policy indicates that a user has access to a specific cluster:
Policy ID: 140555ce-93ac-4fb2-b15d-6ad726795d90 Roles: Administrator Resources: Service Name containers-kubernetes Service Instance df253b6025d64944ab99ed63bb4567b6 Region Resource Type Resource- If we have either of these policies, skip to the second bullet point of step 4. If you do not have either of these policies, skip to the third bullet point of step 4.
Depending on your access policies, choose one of the following options.
If we have access to the cluster and to the resource group that the cluster is in:
Target the resource group. Note: You can't work with clusters in other resource groups until you untarget this resource group.
ibmcloud target -g <resource_group>Target the cluster.
ibmcloud oc cluster config --cluster <cluster_name_or_ID>If we have access to the cluster but not to the resource group that the cluster is in:
Do not target a resource group. If you already targeted a resource group, untarget it:
ibmcloud target --unset-resource-groupTarget the cluster.
ibmcloud oc cluster config --cluster <cluster_name_or_ID>If you do not have access to the cluster:
- Ask your account owner to assign an IBM Cloud IAM platform role to you for that cluster.
- Do not target a resource group. If you already targeted a resource group, untarget it:
ibmcloud target --unset-resource-group- Target the cluster.
ibmcloud oc cluster config --cluster <cluster_name_or_ID>
No resources found
Infrastructure provider:
Classic
VPC Generation 2 compute
What’s happening
When we are running an oc command such as oc get nodes or oc get secrets, you see an error message similar to the following.
No resources found.
Why it’s happening
Your OpenShift token is expired. OpenShift token that are generated by using your IBM Cloud IAM credentials expire in 24 hours.
How to fix it
Re-authenticate with the OpenShift token by copying the oc login command from the web console or creating an API key.
VPN server error due to infrastructure credentials
Infrastructure provider:
Classic
VPC Generation 2 compute
What’s happening
After you create or update a cluster, the master status returns a VPN server configuration error message similar to the following.
VPN server configuration update failed. IBM Cloud support has been notified and is working to resolve this issue.
Why it’s happening
The infrastructure credentials that are associated with the resource group that the cluster is created in are missing (such as the API key owner is no longer part of the account) or missing required permissions.
How to fix it
Complete the troubleshooting guide to check and update the infrastructure credentials that are used for the resource group.