Try and catch the wind

No, nothing to do with eating Brussel sprouts. I was provoked this morning by reading about upcoming developments in Coplot AI inside GitHub ( Microsoft ), which would seem a natural match. I made me think how far we have come in my memory, and speculate where we will go to next. A.I. has moved so quicky that to me, just as I start to grok it, it morphs into something else, like the hydra, but maybe in a good way.

My first program was written in Secondary school in Mr. Copley’s class. Mr. Copley was famous for his R.E. classes so it was a little strange that he should pop up in the maths department. I had already learned some things from the Open University TV about Boolean logic, analogue v digital, simple programing languages like Basic and Algol, but I was yet to get the baptism of running my first program.

The teacher give us the basic outline, and demonstrated how Fortran did its thing with some examples. Then we were told to get out our notebooks and write a simple program. Next lesson, we got some punch cards and a magnetic pencil and converted each symbol into three marks in every column. This was sent away somewhere and returned together with a printout of the code. I don’t remember a hand punch entering the loop anywhere. Long story short I eventually got the output from my program. For the record it was calculating various scenarios or the taps and baths problem. It worked!

There I left it, through 6th form nothing. University, nothing. Post Grad, back to Fortran, a bit of PDP assembler, and some interactive Basic. Never saw a graphic terminal till someone in a lab acquired a Commadore Pet. It played poker.

Eventually I got a job and became a Junior Programmer. 25 or so years in the industry, partly as a contractor, I was able see how it evolved from close up. The PC was a new thing, which I didn’t acquire from many years, but I saw technologies, languages, paradigms come and go. languages included, Cobol, Dibol, Fortran, C, C++,Basic and a few odd balls. I loved it. I would make sure to get my copy of Byte every month, got an AOL account. The days of buying a book, or a bit of software then emailing the author and expecting a reply.

Now for A.I. It seemed at first to be something I didn’t understand that people used to make funny photos and present them to you. It was quite a buzz, and the the rise of the chatbots whether the proverbial Mother or Grandmother is using Chat GPT on a regular basis, I think not.

The news today enthusing about what the beta versions of the GitHub, Copilot team could do left me a bit stunned.

Over to Grok.

GitHub is home to over 420 million repositories as of late 2024, hosting code written by more than 150 million users. These repositories contain an immense volume of code across countless programming languages, frameworks, and projects. While GitHub doesn’t publicly disclose the exact number of lines of code stored on its platform, we can make an educated guess based on available data and trends.

A single moderately sized repository might contain anywhere from a few thousand to tens of thousands of lines of code, while large projects—like open-source frameworks or enterprise software—can easily exceed millions of lines. For instance, the Linux kernel alone has over 30 million lines of code. If we conservatively assume an average of 10,000 lines per repository (accounting for both small scripts and larger projects), multiplying that by 420 million repositories gives us a rough estimate of 4.2 trillion lines of code on GitHub. This is likely an underestimate, as many repositories are far larger, and GitHub’s growth continues to accelerate.

Copilot, however, wasn’t trained on every single line of code in every GitHub repository. It was developed using a subset of publicly available code, filtered and processed by OpenAI to create its underlying model, Codex. This training data includes billions of lines of code from public GitHub repositories, though the exact figure isn’t disclosed. Research and industry reports suggest that large language models like Codex are trained on datasets containing tens to hundreds of billions of lines of code, drawn not just from GitHub but also from other public sources like Stack Overflow and open-source archives. Given GitHub’s dominance as the largest code hosting platform, it’s reasonable to estimate that Copilot’s training data includes at least 50 to 100 billion lines of code from GitHub alone.

Back to me. Seems that Copilot and GitHub could be used to maybe develop improved AI models, or at least improve the GitHub codebase in general. This is synergistic so this feels like a hockey stick moment in time. The singularity is near, or maybe it’s already here.

I’ll leave you with a song.

By incityo_a5lhqq