TL;DR

  1. When Kubernetes nodes exceed 1000, node exporters deployed as DaemonSets on each node also increase.
  2. When Prometheus Operator performs Service discovery with Service monitor, it references the service’s endpoints by default.
  3. Kubernetes Endpoints objects have a default limit of 1000 IPs.
  4. Only 1000 Prometheus scrape targets are maintained.
  5. Prometheus should use endpointslices instead of Kubernetes endpoints for Service discovery.

As Kubernetes clusters grow and the number of nodes exceeds 1000, various challenges arise. One particularly important issue from a monitoring perspective is Prometheus Service Discovery.

For node-exporter, cadvisor, kubelet metrics, etc. deployed as DaemonSets on each node, the number of endpoints increases with the node count. For example:

  • node-exporter: 1000 nodes = 1000 targets
  • kubelet metrics: 1000 nodes = 1000 targets
  • cadvisor: 1000 nodes = 1000 targets

The problem lies in the limitations of Kubernetes’ default Endpoints object. A single Endpoints object can only store up to 1000 IP addresses. When nodes exceed 1000, nodes from 1001 onwards are not included in the Endpoints, resulting in Prometheus being unable to scrape them.

# kubectl get endpoints node-exporter -n monitoring -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  name: node-exporter
  namespace: monitoring
subsets:
  - addresses:
      - ip: 10.0.1.1
      - ip: 10.0.1.2
      # ... up to 1000 only
      - ip: 10.0.4.232
    ports:
      - port: 9100
        protocol: TCP

In this situation, if metrics from some nodes are not collected, monitoring blind spots occur and alarms may not work properly.

Endpoints and EndpointSlices

Kubernetes Endpoints objects store IP addresses of Pods connected to Services. However, there are several limitations:

  1. Size limitation: A single Endpoints object can store a maximum of 1000 IPs
  2. Performance issues: Large Endpoints objects burden etcd and the API server
  3. Update inefficiency: The entire Endpoints object is updated even when a single Pod changes

EndpointSlices, introduced from Kubernetes 1.16, were designed to solve these problems:

  1. Distributed storage: Stored as multiple small objects (100 endpoints per slice by default)
  2. Efficient updates: Only modified slices are updated
  3. Scalability: Can scale without node count limitations
# kubectl get endpointslices -n monitoring -l kubernetes.io/service-name=node-exporter
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: node-exporter-abc123
  namespace: monitoring
  labels:
    kubernetes.io/service-name: node-exporter
addressType: IPv4
endpoints:
  - addresses:
      - 10.0.1.1
    conditions:
      ready: true
  - addresses:
      - 10.0.1.2
    conditions:
      ready: true
# ... up to 100
ports:
  - port: 9100
    protocol: TCP
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: node-exporter-def456
  namespace: monitoring
  labels:
    kubernetes.io/service-name: node-exporter
# ... next 100

Comparison Table

FeatureEndpointsEndpointSlices
Max IP count1000Unlimited (distributed across slices)
Object sizeLarge and singleSmall and multiple
Update efficiencyLow (full update)High (partial update)
API server loadHighLow
Introduced inv1.0v1.16 (GA: v1.21)

Endpoints API Deprecation

Starting from Kubernetes 1.33, the Endpoints API is officially deprecated, and the API server returns warnings to users who read or write Endpoints resources without using EndpointSlices. For more details, see the official Kubernetes blog.

Prometheus Operator’s serviceDiscoveryRole

When using Prometheus Operator, scrape targets are defined through the ServiceMonitor CRD. ServiceMonitor discovers targets by referencing Kubernetes Service Endpoints by default.

Prometheus Operator v0.81.0 Spec: Prometheus

Starting from Prometheus Operator v0.50.0, the serviceDiscoveryRole field was introduced to allow selecting the Service Discovery method:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: kube-prometheus
  namespace: monitoring
spec:
  serviceDiscoveryRole: EndpointSlice # default: Endpoints
  serviceMonitorSelector:
    matchLabels:
      team: platform

Possible values:

  • Endpoints: Use traditional Endpoints objects (default)
  • EndpointSlice: Use EndpointSlices

Considerations When Changing Settings

  1. Check Kubernetes version: EndpointSlices are available in k8s 1.16+, GA in 1.21+
  2. Update RBAC permissions: Need to add permissions for Prometheus to read EndpointSlices
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups: ["discovery.k8s.io"]
    resources:
      - endpointslices
    verbs: ["get", "list", "watch"]
  1. Gradual migration: Validate in test environment first before applying to production

Prometheus Operator and Kubelet Endpoints

Kubelet metrics work differently from the typical Service/Endpoints pattern. Kubelet is a system component running on each node that must be accessed directly without a Service object.

Prometheus Operator automatically creates and manages Endpoints objects for collecting kubelet metrics by default:

apiVersion: v1
kind: Endpoints
metadata:
  name: kubelet
  namespace: kube-system
  labels:
    app.kubernetes.io/managed-by: prometheus-operator # Managed by Operator
    app.kubernetes.io/name: kubelet
    k8s-app: kubelet
subsets:
  - addresses:
      - ip: 10.0.1.1 # node1
      - ip: 10.0.1.2 # node2
    # ... can store up to 1000 only
    ports:
      - name: https-metrics
        port: 10250

The app.kubernetes.io/managed-by: prometheus-operator label indicates that these Endpoints are automatically managed by Prometheus Operator.

Issues

When Prometheus’s serviceDiscoveryRole is changed to EndpointSlice, kubelet also uses EndpointSlice. However, Kubernetes EndpointSlice mirroring has a 1000 limit:

  • EndpointSlice mirroring controller mirrors (by default) a maximum of 1000 endpoints (reference)
  • With more than 1000 nodes, EndpointSlice cannot scrape all kubelets

Solutions

This is the safe migration method suggested in Prometheus Operator GitHub issue #7678:

Step 1: Configure to manage both objects

# Prometheus Operator Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-operator
  namespace: monitoring
spec:
  template:
    spec:
      containers:
        - name: prometheus-operator
          args:
            - --kubelet-endpoints=true # Continue managing Endpoints
            - --kubelet-endpointslice=true # Also manage EndpointSlice

Step 2: Verify EndpointSlice object creation

# Verify that EndpointSlice was created successfully
kubectl get endpointslices -n kube-system -l kubernetes.io/service-name=kubelet

# Check created EndpointSlice
kubectl get endpointslice kubelet-8jkql -n kube-system -o yaml

The created EndpointSlice includes the following labels:

metadata:
  labels:
    app.kubernetes.io/managed-by: prometheus-operator
    endpointslice.kubernetes.io/managed-by: prometheus-operator # Directly managed by Operator
    kubernetes.io/service-name: kubelet

Step 3: Configure Prometheus to use EndpointSlice

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: kube-prometheus
spec:
  serviceDiscoveryRole: EndpointSlice # Use EndpointSlice

Step 4: Disable Endpoints after confirming normal operation

# Disable Endpoints management when everything works normally
args:
  - --kubelet-endpoints=false # Stop creating Endpoints
  - --kubelet-endpointslice=true # Manage only EndpointSlice

2. Preventing EndpointSlice Mirroring

Endpoints managed by Prometheus Operator have special labels added:

apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app.kubernetes.io/managed-by: prometheus-operator
    endpointslice.kubernetes.io/skip-mirror: "true" # Prevent mirroring

The endpointslice.kubernetes.io/skip-mirror: "true" label prevents Kubernetes’s EndpointSlice mirroring controller from automatically mirroring these Endpoints to EndpointSlice. This is because Prometheus Operator directly manages the EndpointSlice.

3. Before and After Migration Comparison

Before migration (managed by Mirroring Controller):

kind: EndpointSlice
metadata:
  labels:
    endpointslice.kubernetes.io/managed-by: endpointslicemirroring-controller.k8s.io

After migration (directly managed by Prometheus Operator):

kind: EndpointSlice
metadata:
  labels:
    app.kubernetes.io/managed-by: prometheus-operator
    endpointslice.kubernetes.io/managed-by: prometheus-operator # Directly managed by Operator

Conclusion

When Kubernetes clusters grow beyond 1000 nodes, monitoring systems must be designed to scale accordingly. By utilizing EndpointSlices, stable metric collection is possible without node count limitations. Migration can be easily accomplished through Prometheus Operator’s serviceDiscoveryRole setting, and for kubelet metrics, the transition can be made using both --kubelet-endpoints and --kubelet-endpointslice flags together.