Rolling update problem

Hi,

I'm experiencing some issues when doing a rolling update over my ringpop cluster.

I'm running the cluster on top of Kubernetes with a [headless service](http://kubernetes.io/docs/user-guide/services/#headless-services) for peer communication. Every DNS query to this service returns a list of all ringpop IPs in the cluster.

I implemented the Kubernetes host provider like this:

``` go
// KubeProvider returns a list of hosts for a kubernetes headless service
type KubeProvider struct {
    svc  string
    port int
}

func NewKubeProvider(svc string, port int) discovery.DiscoverProvider {
    provider := &KubeProvider{
        svc:  svc,
        port: port,
    }
    return provider
}

func (k *KubeProvider) Hosts() ([]string, error) {
    addrs, err := net.LookupHost(k.svc)
    if err != nil {
        return nil, errors.Trace(err)
    }

    for i := range addrs {
        addrs[i] = fmt.Sprintf("%s:%d", addrs[i], k.port)
    }
    return addrs, nil
}
```

During a rolling update, old ringpop services are stopped one by one and new ringpop services are created with a different IP. When a new ringpop service starts, it may see old or new ips in the hosts list.

I'm running 2 instances in the cluster right now, one simply fails to start::

```
{"level":"info","msg":"GossipAddr: 10.244.3.5:18080","time":"2016-08-18T17:43:46Z"}
{"level":"error","msg":"unable to count members of the ring for statting: \"ringpop is not bootstrapped\"","time":"2016-08-18T17:43:46Z"}
{"cappedDelay":60000,"initialDelay":100000000,"jitteredDelay":58434,"level":"warning","local":"10.244.3.5:18080","maxDelay":60000000000,"minDelay":51200,"msg":"ringpop join attempt delay reached max","numDelays":10,"time":"2016-08-18T17:45:01Z","uncappedDelay":102400}
{"joinDuration":134374138254,"level":"warning","local":"10.244.3.5:18080","maxJoinDuration":120000000000,"msg":"max join duration exceeded","numFailed":12,"numJoined":0,"startTime":"2016-08-18T17:43:46.377647091Z","time":"2016-08-18T17:46:00Z"}
{"err":"join duration of 2m14.374138254s exceeded max 2m0s","level":"error","local":"10.244.3.5:18080","msg":"bootstrap failed","time":"2016-08-18T17:46:00Z"}
{"error":"join duration of 2m14.374138254s exceeded max 2m0s","level":"info","msg":"bootstrap failed","time":"2016-08-18T17:46:00Z"}
{"level":"fatal","msg":"[ringpop bootstrap failed: join duration of 2m14.374138254s exceeded max 2m0s]","time":"2016-08-18T17:46:00Z"}
```

The other one attempts to connect to the first and fails all the periodic health checks.

```
{"level":"info","msg":"GossipAddr: 10.244.1.7:18080","time":"2016-08-18T17:43:46Z"}
{"level":"error","msg":"unable to count members of the ring for statting: \"ringpop is not bootstrapped\"","time":"2016-08-18T17:43:46Z"}
{"level":"error","msg":"unable to count members of the ring for statting: \"ringpop is not bootstrapped\"","time":"2016-08-18T17:43:46Z"}
{"joined":["10.244.3.5:18080","10.244.3.5:18080"],"level":"info","msg":"bootstrap complete","time":"2016-08-18T17:43:49Z"}
{"level":"info","msg":"Running on :8080 using 1 processes","time":"2016-08-18T17:43:49Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.0.6:18080","time":"2016-08-18T17:43:49Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"ping request target unreachable","target":"10.244.3.5:18080","time":"2016-08-18T17:43:49Z"}
{"error":"join timed out","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:43:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"ping request target unreachable","target":"10.244.3.5:18080","time":"2016-08-18T17:43:50Z"}
{"level":"info","local":"10.244.1.7:18080","member":"10.244.3.5:18080","msg":"executing scheduled transition for member","state":"suspect","time":"2016-08-18T17:43:54Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:43:54Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:44:20Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:44:20Z"}
{"error":"JSON call failed: map[type:error message:node is not ready to handle requests]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:44:20Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:44:20Z"}
{"latency":"1.323232ms","level":"info","method":"GET","msg":"","request_id":"e21bcc3f05fa04449ab4b9f0520e0933","time":"2016-08-18T17:44:30Z","url":"/_internal/cluster-info"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:44:30Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:44:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:44:50Z"}
{"error":"JSON call failed: map[message:node is not ready to handle requests type:error]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:44:50Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:44:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:45:20Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:45:20Z"}
{"error":"JSON call failed: map[type:error message:node is not ready to handle requests]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:45:20Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:45:20Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"attempt heal","target":"10.244.3.5:18080","time":"2016-08-18T17:45:50Z"}
{"level":"info","local":"10.244.1.7:18080","msg":"reincarnate nodes before we can merge the partitions","target":"10.244.3.5:18080","time":"2016-08-18T17:45:50Z"}
{"error":"JSON call failed: map[type:error message:node is not ready to handle requests]","failure":0,"level":"warning","local":"10.244.1.7:18080","msg":"heal attempt failed (10 in total)","time":"2016-08-18T17:45:50Z"}
{"level":"warning","local":"10.244.1.7:18080","msg":"no pingable members","time":"2016-08-18T17:45:50Z"}
```

Eventually, the first pod timeouts, it is restarted by the cluster manager and it successfully connects to the second pod.

Is it related to #146 ?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rolling update problem #175

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rolling update problem #175

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions