Fully Autonomous AI Software Engineer Devin

Here’s my video response to Devin – it’s for WebGPT🤖, which is a Custom GPT available for ChatGPT Plus subscribers for free. It actually builds a fully-functional top-down shooter in under 20 minutes, outperforming Devin’s Pong demonstration.

You can try the game it built here:

And here’s the video demonstration for anybody doubting its capabilities.

I offer this rebuttal because, this company claiming they are “the first” to do this isn’t remotely close to being the first, and they’re not even the best. And I’ve got zero funding.

I’m tired of the Silicon Valley membership club getting all the visibility, funding, and recognition – especially on these boards – when true-blue bootstrapped entrepreneurs building on this platform go unnoticed, and unrecognized.


How does yours perform on SWE-Bench?

One of the reasons why they get the attention is they publish.

Publish something similar to this,

and you’ll get more attention.

Also, having an established track record helps to elevate you above the fray.

To be clear, I’m not saying great things can’t come from outsiders, just that outsiders lack the credibility that comes with being a large outfit with several PhDs on staff.

It’s the same reason why a random paper on arxiv claiming some breakthrough on twin primes doesn’t make the news the same way it does when Terrance Tao writes something about it.

Statistically speaking, you are less likely to find anything of value after reading a million random papers on twin primes from amateur outsiders than you are to find nothing of value in a blog post about twin primes from Tao.

The nice thing is, as an outsider, it’s relatively easy to establish credibility—Publish.

Publish your sources, code, and results so anyone can run a Python notebook and replicate your results.

Is it unfair that larger, more established outfits can just report results? Yeah, it is, but that’s because they have already established credibility.

Do that enough and you’ll have established credibility, then people will pay attention when you claim to have something.


One of the reasons why I wouldn’t publish something like that is I, from a principles standpoint, reject the technique they’re using. I think that report is a gross misrepresentation of their achievements, relative to the other benchmarks available, with the obvious intent of “generate as much buzz around us at all costs, damn the ethics” – and I vehemently reject this philosophy.



Just had a quick look at SWE bench, and I like the idea of having these systems tested by solving GitHub issues.

I just want to point out that they scored 13.86% which is much higher than other systems, but still far below what I would call a “fully autonomous AI software engineer”

Respectfully, I think you’re the one missing the point (well, making really bad points).

To be clear, them writing a blog where they “published” a graph that is extremely misleading isn’t them “publishing” something in some broader sense beyond what I’ve done. I, in contrast, “published” a video demonstrating my product doing the thing theirs implied it could do.

There’s nothing intrinsically more share-able about their blog versus what I’ve shared.

The difference in how much wider their message has spread is directly proportional to the amount of marketing dollars and publicity they were able to drum up – by spending money – thanks to their $12M round of financing from Peter Theil. A round that wasn’t closed because they published something. They hadn’t published anything when they closed their round.

All of this is to say, I think your advise stunk is all. And I didn’t ask for it. So that’s why I didn’t agree with you, because it wasn’t something that made any clear sense to me.

Yes it’s also very misleading.

  1. The amount of tokens I suspect it cost them to achieve this rate, and the amount of throw-away attempts they had to verify to then find a branch that worked, is probably prohibitively unreasonable. They didn’t talk about that in their “report.”
  2. From what I know from building a system more capable than theirs, they’re being misleading in a number of ways in their contextualization of their autonomy. Again, in the interest of rushing to revenue generation instead of commitment to the tech.
  3. This isn’t to say there’s nothing there to Devin. There’s clearly something there. But it’s not what they’ve implied, and meanwhile, I’m delivering exactly what I’m claiming, and met with veiled criticism (not from you, but generally on this board) – and that’s something that I find perplexing and frustrating.

Interesting point, I’m not sure we agree on what it means to be committed to the tech vs revenue generation, but there’s more benchmark’s for people to look at over here:

1 Like

One thing to consider, and I actually agree with your point that there is a SF centric concentration of activity, however… there is a reason for that.

If you are a huge entity, a Microsoft, an Alphabet, etc. then you can push to change the way things are done and who gets more or less attention. The reality is you need to play the game, publish a paper, even a whitepaper will get attention, show what your product can do in a commercially acceptable way and you will generate attention. Throw in a couple hundred bucks to EIN Presswire or any of the reputable business centric press release companies and you are off to the races.

1 Like

wait so you’re saying Devin is not a new unique LLM-something (possibly more than just an LLM, some breakthrough tech?) but just a wrapper for existing APIs like AutoGPT/CrewAI are? :exploding_head: I didn’t look much into it but I thought it’s a unique new product, isn’t it?

Use ChatGPT as a code assistant for a few days, and then come back and tell us if you think it will replace you as a software engineer.

I watched that video by the Nvidia CEO boldly stating that “coding is dead”. Well, I look at it the same way I look at the statement that “gasoline engines are dead”. Yes, one day. For sure. But not anytime soon.

I met a woman a couple weeks ago who works for a large bank in their IT department coding in – COBOL. Yep. That’s because many of the large banks still use old mainframes for that overnight, bulk processing. You think these corporations are all going to all go out and start replacing their existing “working” and “cost-effective” systems with new, expensive hardware just so they can use Nvidia chips? I’m not talking about the FAME and other high-tech corps who survive on venture capital and stock market speculation. I’m talking about the rest of the world’s companies who actually have to show PROFIT on their balance sheets to survive. I rather think in the short term they are going to figure out how to use the new technology to reduce their costs and boost their bottom lines. That will mean, in the vast majority of cases, at least in the near future, optimizing the resources they already have.

And, this is the reality we are dealing with before we even begin to broach the subject of: Do we really want to take away all human oversight in future software development?

1 Like

I wish coding was dead!

I wanted to build this thing but didn’t want to do the coding.

I cannot for the life of me figure out how to get GPT to build it beyond some very trivial scaffolding that I could do myself.

Don’t get me wrong, it’s a great dev tool, but unless you’re a senior engineer, you’re not going to be able to anything competitive with it.

100%, it’s actually funny to see the public and non-coders excited about those ‘word mixers’ which are LLMs. The time when this at least will become “somehow useful” in larger and more complex projects than “snake game in python” and “make me a sudoku game in JS” is approaching BUT is not even there yet. Those of us who do actual coding (quality business software and the like) enjoy the small help of fetching quick snippets of well known stuff, using the copilot, which saves a bit of time vs looking in up in docs (for whatever framework), but even that simple ‘assistance’ is flawed by pulling outdated and wrong info (the LLMs are “word-mixers by probability score”, nothing else, how in the world would they help in real coding and business logic tasks?).

So at the moment, situation in real world is, we the developers are waiting for a better quality AIs and LLMs (or whatever new tech data science will come up with) so that it will AT LEAST help with skimming the docs for info! So far it can’t even do that simple task. I’m not even thinking about “when it will replace developers”, this stage is so freakin far away that we all are waiting for, but it’s not 2-3 years away, more like 10-12 IMHO. and all the AGI talk on youtube etc is obviously for the masses (for people who know no better than binge youtube bs all day :laughing: well let them enjoy, why not, a form of entertainment after all, AGI topic is not less fun than ‘illuminati and pyramids’ and sh%t, and everyone is bored watching the same bs about the pyramids so we have new entertainment about AGI soon comes to kill us all :joy_cat: ).


you can try ‘[link removed by Moderator]’ and ‘langchain’ agents working together to weave the final result, if it’s a very simple app or website boilerplate (nothing unique - because the moment you introduce a unique concept or approach that the LLMs are not familiar with in their training data, it all falls apart, they can only build something they’ve seen or seen parts of, in the training set) but the skill required to prompt the chains of ‘agents’ correctly, for the desired output, is higher than the skill to simply write it down yourself in whatever language it has to be built with. So yeah, unfortunately no ‘zero code’ building with LLMs exist yet. and the ‘Devin’ is just a fun twist on top of the ‘agents’ concept (link agent outputs into inputs of next agent and let them chew it over and over, plus give them tools to call like google search, perplexity, etc’ and display the data in the process so it’ll look cool for the masses to subscribe and pay $$ believing in AI magic. actually pretty smart, all of that buzz, a good money is being made on subs to various AI services).

1 Like

Well said @HappyQuokka !!!

As a developer, I’ve been trying to get Gemini Pro with it’s vaunted 1M token context to actually do something useful with large amounts of text, like render a hierarchal breakdown – you know, “large language” stuff. Yet, it consistently fails. And, when I try to get help on the Gemini API Discord, outside of the couple of people there who try to help everyone, there are crickets. Because, everybody else is using it to create snake games and word puzzles and finding the cat in the picture, etc… I am amazed how few people use large language models to solve actual language issues,.

It’s amazing. All this hype everywhere about how it’s going to take over the world, and AI can’t even accurately tell you the character position or line number in a document where a specific phrase is written. Now, that may seem meaningless to most people reading this, but if you intend to use this AI to solve actual business problems, having it tell you the answer is in a 1,000 page document really isn’t that big of a help since you’ve still got to keyword search to corroborate. Giving you the exact page and paragraph (preferably with a link) is far more useful in the real world.

When GPT4 first came out, I spent a lot of work trying to make this happen.

I think the problem is that GPT4 simply can’t hold a large idea in its head in an effective way.

There might be a way, fine tuning GPT4 to build some specific thing, ie, you fine tune the idea into the model. Something to try.

As for how long the stage is away from now, I don’t know. Breakthroughs are possible and certainly a lot of people are focused on this. What we have now however, it doesn’t seem doable.

What do you think of the swe-bench? Swe-bench: Very exciting eval, looking for SOTA - #5 by N2U

1 Like

What’s up my friends. Small word of advice. Complain less and do more. You all sound like the type of people who are wasting their time complaining and beating yourself up because you’re not changing your sharts and getting out there. Dudes you can be whatever you want to be and go wherever you want to go. Don’t be an ignorant scared wobble knee. Nut up and get some. Do some good. A lot of people are and you could be one of them. First step, stop wasting your time looking for fellow glum bums on the openai forum. Second step, recognize that every era throughout time has had doomsdayers thinking like you. When the railroad came out a medical doctor said pregnant women utterers’ would fly out their box when the train took that 40 mph corner (the humanity of those evil technocrat robber baron swineses, how could they🙀). Tons of people got burned at the stake because they invented satans heaven portaler (a telescope). Ah then there was the thing that was super f’d up beyond satan’s magic and that was of course squiggly lines on animal hide - also know as the great evil - writing and reading. Start reading, don’t take my word for it. All you guys complaining have got this. Just do it.