Skip to main content

Node Health Monitoring

The node health matrix provides comprehensive monitoring of individual cluster nodes with detailed status cards, system information, and real-time resource usage tracking.

Node Health Matrix Overview

Individual Node Cards

Each cluster node is displayed as a detailed status card containing comprehensive health and system information:

![Figure needed]

Screenshot of node health matrix showing multiple node cards

Card Layout

  • Grid Display: Nodes arranged in responsive grid layout
  • Consistent Design: Uniform card design across all nodes
  • Real-time Updates: All cards update simultaneously every 30 seconds
  • Interactive Elements: Expandable sections for detailed information

Node Card Components

Header Information

![Figure needed]

Screenshot of node card header showing basic node information

Node Identification

  • Node Name: Kubernetes node identifier (e.g., "worker-node-1")
  • Operating System: OS type and architecture (e.g., "linux/amd64")
  • Instance Type: vCloud instance configuration (e.g., "standard-4vcpu-8gb")
  • Uptime: Duration the node has been running (e.g., "3 days, 14 hours")

Health Status Section

Status Indicators

Visual health indicators with clear iconography:

![Figure needed]

Screenshot showing different node health states

  • ✅ Ready: Node is healthy and accepting pods

    • Green indicator: All systems operational
    • Available for scheduling: Can accept new pods
    • All checks passing: System health checks successful
  • ⚠️ Warning: Node has issues but is functional

    • Yellow indicator: Some issues detected
    • Limited functionality: May have reduced capacity
    • Monitoring required: Needs attention but operational
  • ❌ Not Ready: Node is not available for scheduling

    • Red indicator: Critical issues present
    • No new pods: Cannot accept new workloads
    • Investigation needed: Requires immediate attention

Health Status Details

  • Status Tooltips: Hover for detailed status information
  • Condition Explanations: Clear explanation of current state
  • Time Information: When status last changed
  • Impact Assessment: What the status means for workloads

Resource Usage Monitoring

CPU Usage Display

![Figure needed]

Screenshot of CPU usage display with progress bar and details

Information Shown:

  • Current Usage: Cores and percentage (e.g., "2.1 cores (52.5%)")
  • Visual Progress Bar: Color-coded utilization indicator
  • Allocatable Capacity: Total available CPU cores
  • Usage Trend: Indicators for increasing/decreasing usage

Color Coding:

  • Green (0-70%): Healthy CPU utilization
  • Orange (70-85%): High utilization, monitor closely
  • Red (85%+): Critical utilization, action needed

Memory Usage Display

![Figure needed]

Screenshot of memory usage with detailed breakdown

Information Shown:

  • Current Usage: Memory and percentage (e.g., "6.2 GB (77.5%)")
  • Visual Progress Bar: Same color scheme as CPU
  • Available Memory: Total node memory capacity
  • Memory Breakdown: Used vs. available memory

Memory Categories:

  • Used: Currently allocated memory
  • Available: Free memory for new allocations
  • Reserved: System reserved memory
  • Total: Complete node memory capacity

Pod Allocation Section

Pod Count Information

![Figure needed]

Screenshot of pod allocation display showing current vs capacity

Metrics Displayed:

  • Running Pods: Current number of pods on the node
  • Pod Capacity: Maximum pods the node can handle
  • Utilization Percentage: Pod allocation as percentage of capacity
  • Allocation Efficiency: How effectively pods are distributed

Load Indicators

  • High Load: Node approaching pod capacity
  • Medium Load: Balanced pod allocation
  • Low Load: Node has significant pod capacity available
  • Optimization: Indicators for rebalancing opportunities

System Information (Expandable)

Accessing Detailed Information

Click the expand button or section to view comprehensive system details:

![Figure needed]

Screenshot of expanded system information section

Software Information

  • Kubernetes Version: Node's Kubernetes version (e.g., "v1.32.0")
  • Container Runtime: Runtime version (e.g., "containerd 1.7.2")
  • Kernel Version: Operating system kernel version
  • OS Distribution: Linux distribution and version

Hardware Information

  • Instance Details: vCloud instance type and specifications
  • CPU Architecture: Processor architecture (x86_64, ARM, etc.)
  • Memory Configuration: Total memory and allocation
  • Storage Configuration: Local storage details

Location Information

  • Region: Geographic region placement
  • Availability Zone: Specific zone within region
  • Resource Group: Associated vCloud resource group
  • Network: Network configuration and connectivity

System Identifiers

  • Hostname: System hostname
  • Machine ID: Unique machine identifier
  • Boot ID: Current boot session identifier
  • System UUID: Hardware system UUID

Node Conditions Monitoring

Health Conditions

Detailed health condition indicators:

![Figure needed]

Screenshot of node conditions section showing various health checks

Condition Types

  • Memory Pressure: Available memory status

    • False: Sufficient memory available
    • True: Memory pressure detected, may affect scheduling
  • Disk Pressure: Available disk space status

    • False: Adequate disk space
    • True: Disk space low, may impact operations
  • PID Pressure: Process ID availability status

    • False: Sufficient PIDs available
    • True: Process limit approaching
  • Network Connectivity: Network health status

    • Ready: Network connectivity functional
    • Issues: Network problems detected

Condition Details

  • Status: Current condition state (True/False/Unknown)
  • Last Transition: When condition last changed
  • Reason: Explanation for current state
  • Message: Detailed information about condition

Real-time Monitoring Features

Data Refresh Behavior

  • Automatic Updates: All node cards update every 30 seconds
  • Synchronized Updates: All nodes update simultaneously
  • Background Processing: Updates don't interrupt user interaction
  • Status Consistency: Ensures consistent view across all nodes

Performance Indicators

  • Resource Trends: Visual indicators for usage trends
  • Capacity Warnings: Alerts for approaching limits
  • Health Notifications: Clear indicators for health changes
  • Load Distribution: Visual representation of cluster load balance

Using Node Health Information

Daily Operations

  1. Health Overview: Quick scan of all node health indicators
  2. Resource Check: Verify no nodes are resource-constrained
  3. Capacity Planning: Identify nodes approaching capacity limits
  4. Issue Detection: Spot problematic nodes requiring attention

Troubleshooting

  1. Problem Identification: Find nodes with health issues
  2. Resource Diagnosis: Identify resource pressure points
  3. Capacity Analysis: Understand resource utilization patterns
  4. Performance Investigation: Deep dive into node performance

Capacity Planning

  1. Usage Patterns: Analyze resource usage across nodes
  2. Load Distribution: Ensure even workload distribution
  3. Scaling Decisions: Identify when to add or remove nodes
  4. Optimization: Find opportunities for better resource utilization

Best Practices

Regular Monitoring

  1. Daily Checks: Review node health matrix daily
  2. Pattern Recognition: Notice trends in resource usage
  3. Proactive Management: Address issues before they become critical
  4. Documentation: Record significant observations

Performance Optimization

  1. Load Balancing: Ensure even distribution across nodes
  2. Resource Right-sizing: Match node specs to workload needs
  3. Capacity Management: Maintain appropriate headroom
  4. Health Maintenance: Keep all nodes in healthy state

Issue Response

  1. Quick Assessment: Use health matrix for rapid issue assessment
  2. Prioritization: Focus on critical health issues first
  3. Impact Analysis: Understand how node issues affect workloads
  4. Escalation: Know when to escalate to support

Next: Learn about Resource Analytics for pod distribution and namespace analysis.