Skip to content

Rolling update problem #175

@teodor-pripoae

Description

@teodor-pripoae

Hi,

I'm experiencing some issues when doing a rolling update over my ringpop cluster.

I'm running the cluster on top of Kubernetes with a headless service for peer communication. Every DNS query to this service returns a list of all ringpop IPs in the cluster.

I implemented the Kubernetes host provider like this:

// KubeProvider returns a list of hosts for a kubernetes headless service
type KubeProvider struct {
    svc  string
    port int
}

func NewKubeProvider(svc string, port int) discovery.DiscoverProvider {
    provider := &KubeProvider{
        svc:  svc,
        port: port,
    }
    return provider
}

func (k *KubeProvider) Hosts() ([]string, error) {
    addrs, err := net.LookupHost(k.svc)
    if err != nil {
        return nil, errors.Trace(err)
    }

    for i := range addrs {
        addrs[i] = fmt.Sprintf("%s:%d", addrs[i], k.port)
    }
    return addrs, nil
}

During a rolling update, old ringpop services are stopped one by one and new ringpop services are created with a different IP. When a new ringpop service starts, it may see old or new ips in the hosts list.

I'm running 2 instances in the cluster right now, one simply fails to start::

{"level":"info","msg":"GossipAddr: 10.244.3.5:18080","time":"2016-08-18T17:43:46Z"}
{"level":"error","msg":"unable to count members of the ring for statting: \"ringpop is not bootstrapped\"","time":"2016-08-18T17:43:46Z"}
{"cappedDelay":60000,"initialDelay":100000000,"jitteredDelay":58434,"level":"warning","local":"10.244.3.5:18080","maxDelay":60000000000,"minDelay":51200,"msg":"ringpop join attempt delay reached max","numDelays":10,"time":"2016-08-18T17:45:01Z","uncappedDelay":102400}
{"joinDuration":134374138254,"level":"warning","local":"10.244.3.5:18080","maxJoinDuration":120000000000,"msg":"max join duration exceeded","numFailed":12,"numJoined":0,"startTime":"2016-08-18T17:43:46.377647091Z","time":"2016-08-18T17:46:00Z"}
{"err":"join duration of 2m14.374138254s exceeded max 2m0s","level":"error","local":"10.244.3.5:18080","msg":"bootstrap failed","time":"2016-08-18T17:46:00Z"}
{"error":"join duration of 2m14.374138254s exceeded max 2m0s","level":"info","msg":"bootstrap failed","time":"2016-08-18T17:46:00Z"}
{"level":"fatal","msg":"[ringpop bootstrap failed: join duration of 2m14.374138254s exceeded max 2m0s]","time":"2016-08-18T17:46:00Z"}

The other one attempts to connect to the first and fails all the periodic health checks.

{"level":"info","msg":"GossipAddr: 10.244.1.7:18080","time":"2016-08-18T17:43:46Z"}
{"level":"error","msg":"unable to count members of the ring for statting: \"ringpop is not bootstrapped\"","time":"2016-08-18T17:43:46Z"}
{"level":"error","msg":"unable to count members of the ring for statting: \"ringpop is not bootstrapped\"","time":"2016-08-18T17:43:46Z"}
{"joined":["10.244.3.5:18080","10.244.3.5:18080"],"level":"info","msg":"bootstrap complete","time":"2016-08-18T17:43:49Z"}
{"level":"info","msg":"Running on :8080 using 1 processes","time":"2016-08-18T17:43:49Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.0.6:18080","time":"2016-08-18T17:43:49Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"ping request target unreachable","target":"10.244.3.5:18080","time":"2016-08-18T17:43:49Z"}
{"error":"join timed out","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:43:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"ping request target unreachable","target":"10.244.3.5:18080","time":"2016-08-18T17:43:50Z"}
{"level":"info","local":"10.244.1.7:18080","member":"10.244.3.5:18080","msg":"executing scheduled transition for member","state":"suspect","time":"2016-08-18T17:43:54Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:43:54Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:44:20Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:44:20Z"}
{"error":"JSON call failed: map[type:error message:node is not ready to handle requests]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:44:20Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:44:20Z"}
{"latency":"1.323232ms","level":"info","method":"GET","msg":"","request_id":"e21bcc3f05fa04449ab4b9f0520e0933","time":"2016-08-18T17:44:30Z","url":"/_internal/cluster-info"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:44:30Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:44:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:44:50Z"}
{"error":"JSON call failed: map[message:node is not ready to handle requests type:error]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:44:50Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:44:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:45:20Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:45:20Z"}
{"error":"JSON call failed: map[type:error message:node is not ready to handle requests]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:45:20Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:45:20Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:45:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:45:50Z"}
{"error":"JSON call failed: map[type:error message:node is not ready to handle requests]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:45:50Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:45:50Z"}

Eventually, the first pod timeouts, it is restarted by the cluster manager and it successfully connects to the second pod.

Is it related to #146 ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions