I read a post on HN about somebody who used ChatGPT to invent the rules for a new human language. I thought it would be interesting to see if ChatGPT could also be used to create a new programming language. I won’t bore you with too much of the dialog, since we’ve all seen impressive exchanges so far. This post summarizes what I learned about working with ChatGPT to do a task like this.
Prompt Engineering To Build A Programming Language
In my first attempt, I started by attempting to build a language from the ground up. ChatGPT initially asserted that it was unable to invent languages, so I had to get around this by telling it that we were writing a story in which it had the abilities to invent languages, and introducing each new concept with “suppose we needed to find a way to …; what are some alternatives that you might suggest?”. This construct allowed it to be generative. However, building a language from the ground up in this way was quite tedious (go figure), and after about an hour ChatGPT appeared to have forgotten some rules that we’d decided on.
The next day I made a second attempt. In this session, I gave ChatGPT the same story construct, and told it that I wanted to use Kotlin as a starting point. I asked it to write a program that illustrated the major features of Kotlin that we could then use to make modifications in order to define our new language, L++. It happily dumped out a program. I first got rid of a few things from Kotlin (var, for example) and replaced various keywords with symbols. Then I asked it about major features that Scala had which Kotlin did not, and we adopted a few into our language. I did the same again for Haskell, and borrowed one or two features.
As I proceeded, I set checkpoints by asking it to update the example program with all of the features we’d decided on so far, and told it to (for example) “remember this as L++ 1.1”; I was then able to later tell it things like “let’s try rewriting L++ 1.1 where instead of the return keyword we use the arrow symbol.”
Eventually either the modifications we had made got complex enough or the server was particularly busy, and it kept printing incomplete responses. So, I asked it to summarize all of the language rules we had decided on so far (we’d reached a milestone of 1.6 at this point) in as concise a manner as it could, so that I could feed the summary back into a future session. It did this, and produced the following (which wasn’t 100% complete, but was close enough):
Here is a concise summary of the changes we made to get from Kotlin to L++ 1.6:
We removed the var keyword and made all assignments final
We replaced println with the @ symbol
We combined the for and while loop syntax into a single loop construct
We added dependent types
We replaced the if and else keywords with a ? and _ syntax
We replaced the val and fun keywords with the $ and # symbols
We added algebraic data types and type classes
We replaced the return keyword with the arrow symbol
I then ended the session. In the third session, I again introduced the story construct, and told that it in a previous session it had produced a summary that I could use to remind it of what we had decided on so far. I asked it to produce an example program from the summary. It did; there were a few features missing, because the summary wasn’t complete, so the first thing I did was reintroduce those.
For example:
me: let’s remove the return keyword in those examples. the return value is implicit as the last expression in a block
(Granted that this contradicts the arrow symbol rule, but it had interpreted that in a different way in any case.)
It also needed prompting to make the example more complete:
me: ok. this is good. let’s remember this as L++ 1.6.1. can you extend the example to illustrate type classes and algebraic data types?
I then moved on to “improving” the syntax:
me: ok, this is good. i don’t like the
data
,trait
, andinstance
keywords. can we find a symbolic representation of these?
It provided several options with examples, and I responded:
me: %, @ and # are good symbols to use for these. let’s rewrite the most recent example of L++ 1.7.0 alpha to include these changes
It updated the example code with these modifications, although it had a mistake in one section, which I corrected:
me: ok great. one modification to this - in the Printable type class, use ? and _ syntax instead of when
I wanted the comments that it was including in the example program to be better, so I asked it:
me: brilliant. can you rewrite this with comments that explain each symbol that we use?
It then started cutting off the example code, perhaps because it was getting too complex. I hit Try Again a few times, and made a few more attempts to prompt it. Finally, I hit upon this, which worked better:
me: let’s give it one more shot, because your server must be busy. let’s try splitting this into 6 short code blocks
How Far I Got
Here’s what it produced:
Sure, here is an updated version of the program that includes comments
explaining the meaning of each symbol that is used. This version is split
into six code blocks to make it easier to read:
// Declare and initialize variables
// The $ symbol is used to declare variables and assign values to them
$ x = 5
$ y = 7
$ z = 0
$ result = 0
// Define a type class for a "printable" type
// The @ symbol is used to define type classes
@ Printable {
// Define a method that converts a value to a string
// The $ symbol is used to declare methods in a type class
$ toString() -> String
}
// Define an algebraic data type for a binary tree
// The % symbol is used to define algebraic data types
% Tree = Leaf | Node(value: Int, left: Tree, right: Tree)
// Implement the Printable type class for the Tree type
// The # symbol is used to define type class instances
# Printable for Tree {
// The $ symbol is used to declare methods in a type class instance
$ toString() -> String {
// The ? symbol is used to check a condition
? this is Leaf {
// If the condition is true, the expression after the ? symbol is evaluated
"Leaf"
}
// The _ symbol is used to specify the expression to evaluate if the condition is false
_ {
is Node(value, left, right) -> "Node(" + value + ", " + left.toString() + ", " + right.toString() + ")"
}
}
}
// Define a function to calculate the sum of two numbers
// The $ symbol is used to declare functions
$ sum(a: Int, b: Int) -> Int {
$ result = a + b
result
}
// Use the ? operator to check a condition
? x > y {
// Use the $ symbol to define an anonymous function
// The $ symbol is used to declare anonymous functions
$ () -> Unit {
// The @ symbol is used to print a value
@ "x is greater than y"
}
} _ {
// Use the $ symbol to define an anonymous function
// The $ symbol is used to declare anonymous functions
$ () -> Unit {
// The @ symbol is used to print a value
@ "x is less than or equal to y"
}
}
The Limit
I then asked it to implement two simple programs using the language:
me: ok this is very good. let’s remember this as L++ 1.7.0 final. can you write two example programs using this language? first, show an efficient way to calculate the fibonacci sequence up to 1000. second, find the factorial of each prime number under one million in an efficient way.
It produced results which were reasonable first attempts, but limited; for example, instead of calculating all primes under one million, it did:
// Declare and initialize variables
$ primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59,
61, 67, 71, 73, 79, 83, 89, 97]
I corrected this by:
me: try again without declaring literal values for the primes
However, there was another issue which I wasn’t able to fix; it “forgot” that assignments were supposed to be final, and so did things like:
$ () -> Unit {
$ isPrime = true
$ j = 2
loop(j < i) {
$ () -> Unit {
? i % j == 0 {
$ isPrime = false
}
}
$ j += 1
}
? isPrime {
primes.add(i)
}
}
I attempted to correct this with:
me: this is almost right. don’t forget that variables are immutable, so you can’t reassign them
It responded:
You're right, I forgot that variables in L++ 1.7.0 final are immutable and cannot
be reassigned. Here is an updated version of the program that uses the L++ 1.7.0
final syntax to find the factorial of each prime number under one million without
declaring literal values for the primes and without reassigning variables:
...
[but then did the exact same thing]
So then I asked:
me: can you rewrite with line numbers to make it easier for me to ask you about specific lines?
…which it did, and I then followed up with:
me: ok. it looks like on line 18 you are reassigning the value of isPrime, which was declared on line 11. if i’m right, please rewrite this so that variables are immutable references once assigned
It replied:
You're right, I accidentally used the same variable name isPrime on line 11 and
line 18, which would cause a reassignment error since variables in L++ 1.7.0 final
are immutable and cannot be reassigned. Here is an updated version of the program
that uses the L++ 1.7.0 final syntax to find the factorial of each prime number
under one million without declaring literal values for the primes and without
reassigning variables, and with line numbers added to make it easier to reference
specific lines:
...
[and then the exact same thing]
So, ChatGPT and me are at an impasse. It’s interesting that it appears to understand the problem, but is incapable of rewriting the code without the bug. If you’ve read this far I highly recommend trying a similar exercise with it, as I found it to be a useful learning exercise. My language modifications were fairly ad hoc; if you approached this in a more deliberate way, you might get better results out of it. And, if you run into a similar impasse and find your way around it with ChatGPT, please post about it on HN.