How do you test your plugin?

Hi everyone,

I’m keen to hear how people are testing their plugins - particularly around iteration of openapi spec descriptions and description_for_model.

When making changes, how do you test and get confident in their impact? Particularly given how inconsistent ChatGPT can be.

  • Do you have a script of prompts you test and run multiple times?
  • Do you have any automated testing here?
  • Do you have a staging plugin which some users have access to?

I’d love to build an idea of what the best-in-class approach to this is.

Edit. I’m particularly keen to hear how people are testing plugins that are already in production.

Have you seen?

Beta Tester Matching: Community Project for Helping one Another Beta Test

2 Likes

Hey! Yes, that looks a great option for unreleased plugins, but I’m not sure it helps when testing plugins already on the store?

Sorry, I wasn’t super clear in my original post, but I’m especially keen to learn how people are testing new versions of production plugins. i.e. Before changes are released to real users.

Hi James, we aren’t doing this yet, but our plan has been to use Statsig for feature gating, and a/b test variations of API spec structure and descriptions. The challenge with variation testing is that either your number of samples needs to be high, or the variance in impact needs to be large. Still, in practice, early variation testing tends to expose larger swings in impact, and so it is worth trying even if you don’t have millions of users. (Fair disclosure, the CEO of Statsig and I worked together at Facebook for 11 years, and so I am biased towards their approach to this aspect of product-building.) I’d be very interested to learn if you end up trying something like this, with or without the backend infra and tooling from a company like theirs.

1 Like

Interesting. How would you a/b test API spec? Serve a different openapi.yaml to different users?

The challenge with variation testing is that either your number of samples needs to be high, or the variance in impact needs to be large.

Yes, I think that I might need a bit more traffic before taking that approach, but I’ll bear it in mind.

Yes – that was the idea I was suggesting. And I agree that numbers need to be higher than a trickle.

I’ll add that the current behavior is pretty flexible for API responses. With responses it is easier to vary behavior, because in many cases you don’t need to change the API spec.

I’m curious what testing approaches you had in mind?

1 Like

I’m curious what testing approaches you had in mind?

I’m experimenting with API responses, and updating description_for_model less frequently (due to need to resubmit).

Ideally I’d like to have a staging/beta version of the plugin for select users, so I can get a better idea of what impact changes have before rolling them out more widely. i.e. a TaskML Nightly/Beta.

Failing that, perhaps some kind of automated testing to cover common user flows.

Ah, I like the beta version for select users. This addresses the issue of taking a risk on being out of the store only to learn the update was not a winner. I built Facebook’s mobile platform back in 2011-2014, and can also speak to patterns like this one emerging and helping inform platform venders (like OpenAI) feature needs.

1 Like