Skip to content

Conversation

@jonathanelbailey
Copy link

@jonathanelbailey jonathanelbailey commented Feb 11, 2025

Description

This PR fixes a known issue when Emissary is configured to connect to an Istio mTLS network. When Emissary's Istio sidecar initiates a cert rotation the istio-certs secret generates a cache entry, but does not generate a delta since the istio-certs cache entry does not map to an actual resource on the Kubernetes cluster. This can eventually resolve itself if a delta is generated that can initiate a change, but other times it results in an unrecoverable error that can cause traffic disruption without custom health checks.

Related Issues

#4744

Testing

Checklist

  • Does my change need to be backported to a previous release?

    • What backport versions were discussed with the Maintainers in the Issue?
  • I made sure to update CHANGELOG.md.

    Remember, the CHANGELOG needs to mention:

    • Any new features
    • Any changes to our included version of Envoy
    • Any non-backward-compatible changes
    • Any deprecations
  • This is unlikely to impact how Ambassador performs at scale.

    Remember, things that might have an impact at scale include:

    • Any significant changes in memory use that might require adjusting the memory limits
    • Any significant changes in CPU use that might require adjusting the CPU limits
    • Anything that might change how many replicas users should use
    • Changes that impact data-plane latency/scalability
  • My change is adequately tested.

    Remember when considering testing:

    • Your change needs to be specifically covered by tests.
      • Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
    • You also need to make sure that the entire area being changed has adequate test coverage.
      • If existing tests don't actually cover the entire area being changed, add tests.
      • This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
    • We should lean on the bulk of code being covered by unit tests, but...
    • ... an end-to-end test should cover the integration points
  • I updated CONTRIBUTING.md with any special dev tricks I had to use to work on this code efficiently.

  • The changes in this PR have been reviewed for security concerns and adherence to security best practices.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. python Pull requests that update Python code t:bug Something isn't working labels Feb 11, 2025
@jonathanelbailey jonathanelbailey force-pushed the fix_istio_cert_rotation_bug_issue_4744 branch from 3d181ae to c2e36f2 Compare February 11, 2025 04:27
@kflynn
Copy link
Member

kflynn commented Feb 14, 2025

Hey @jonathanelbailey, are you on the CNCF Slack by any chance? I'd like to chat about this one... 🙂

Copy link
Member

@kflynn kflynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning toward taking this but there are some corner cases I'm thinking about -- nice work in any case, many thanks! 🙂 I'm most curious, I think, about the test case, and about what testing you've put this through in general?

Thanks!! and happy to talk on Slack if you'd rather.

# We have a cache. Start by assuming that we'll need to reset it,
# OK. If we don't have a cache and there are no deltas, just skip all this crap.
if (cache and fetcher.deltas) is not None:
# We have a cache and deltas is non null. Start by assuming that we'll need to reset it,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that fetcher.deltas is indeed always non-None, I think that this comment and the default for reset_cache below need changing. Probably we can just default reset_cache to True.

# Yes. We're going to walk over them all and assemble a list
# of things to delete and a count of errors while processing our
# list.
# Yes. We're going to walk over them all and assemble a list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Yes. We're going to walk over them all and assemble a list
# We're going to walk over all the deltas and assemble a list

]
)
def test_check_deltas(self, name, cache_entry, deltas, expected, caplog):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly question: does this test fail without your change? 🙂

@emissary-ingress emissary-ingress deleted a comment from kflynn Mar 25, 2025
Comment on lines +138 to +139
# OK. If we don't have a cache and there are no deltas, just skip all this crap.
if (cache and fetcher.deltas) is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Restoring the comment that @the-wondersmith accidentally deleted 😂):

This logic feels a little weird – overall I think you're going for 'if we have a cache and there are deltas", and I think that'd be better expressed as

if (cache is not None) and fetcher.deltas:

My reasoning is that fetcher.deltas actually cannot ever be None – it's either an empty list or a list with some things in it.

@ppeble
Copy link
Member

ppeble commented Aug 28, 2025

@jonathanelbailey Hello! Would you be willing to take a look at @kflynn's comments?

Additionally, could you please target this MR at the dev/v4 branch? We would like to include your change in the next major release, which we are actively working towards. If you put it into the other branch then we won't have to cherry pick it from master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python Pull requests that update Python code size:L This PR changes 100-499 lines, ignoring generated files. t:bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants