
A Sudoku-Solving Neural Network
1. Introduction
When I set out to start my research and development assignment, I wanted to find something that would blend my world of interests with the game development scene. It has been a long time since I did anything related to game development at all, so to go dive headfirst into research about shaders did not seem like the greatest idea. My interest lies in data and the cool things you can do with it, such as machine learning. From there, I figured I could train a neural network to solve sudokus.
Of course, there is absolutely no real world need for this, but regardless it seemed like a fun challenge. Sudokus are quite a fun, unique mathematical puzzle – in the way that every sudoku only has one possible solution. Meanwhile, there are almost infinite possible sudokus to solve. Or well, infinite to our little brains.. there are actually 6,670,903,752,021,072,936,960 possible solvable sudoku grids according to Brittanca [1] (that’s 6 sextillion, 670 quintillion, 903 quadrillion, 752 trillion, 21 billion, 72 million, 936 thousand, 960 – in case you were wondering).
Originally I intended to research the differences between different neural networks created by different Python libraries, however, over time this research direction changed somewhat. I first changed (with approval of the lecturers) to wanting to use computer vision to ‘read’ a sudoku and then use my neural network to solve this sudoku puzzle. Due to unforeseen circumstances this was not possible, so I ended up comparing my neural network to the original backtracking solution to the problem. What came out of this research is displayed in this blogpost, which does not include my steps towards other libraries or computer vision.
1.1 Classification or regression
The research question brought me to an interesting debate quite early on in my research. In machine learning (especially supervised learning), you tend to think in two types of problem spaces. Your problem is either a regression problem, where you try to find the relationship between different variables to reach a prediction of sorts. Or, your problem is a classification problem, where you try to predict if data is part of class A, B, C etcetera. I had some discussions about this with Daniel and my guild members, but we kept going back and forth. Eventually it did not matter much, since you don’t need to know whether it is a regression problem or a classification problem when working with neural networks.
1.2 Python libraries
Having picked my research subject, I looked into multiple different neural network libraries for Python. The three main ones you come across are Keras, TensorFlow and PyTorch.

According to Edureka [2], Keras is a high-level API capable of running on top of TensorFlow. It has gained favor for its ease of use and syntactic simplicity, facilitating fast development. Meanwhile, TensorFlow is a framework that provides both high and low level APIs while PyTorch is a low-level API. What this means in layman’s terms, is that you need a lot more knowledge to build a neural network in PyTorch than you need for a Keras neural network. As a downside to this, Keras tends to be slower than TensorFlow and PyTorch and it does tend to struggle with large datasets more than a PyTorch neural network. This tends to make Keras unsuitable for production environments, but it does make it a perfect fit for this research project.
So, with all this information, I got started on the actual development part of Research & Development.
2. Data
2.1 Dataset
I first started looking around on Kaggle to find a dataset of sudoku puzzles that I could use to train my neural network with. Thankfully, there are lots of them to choose from, so I picked the biggest one I could find which ended up being a dataset of 9 million sudoku puzzles and their solutions, uploaded by Vopani on Kaggle [3]. Right is a snippet of how this data looks to begin with.
2.2 Data transformation
Then I started looking around if other people had attempted this before, since I knew I would have to somehow transform this string of numbers into something that could actually be read as a big grid of numbers instead. I found a lot of articles, but the one that helped me get started was an article and github repository by S. Verma [4] on solving sudokus. On his github repository he has a file called data_preprocess.py where he transforms the original data into a 9×9 grid. He also performs zero-mean normalisation. This is a common requirement for machine learning algorithms according to Scikit-Learn [5] to prevent the algorithms from straying towards outliers too much. This will come up in the next chapter as well, since we need to normalise the data between each model layer again as well. Once the data has gone through the data preprocessing code I used from S. Verma [4], the data has been split into a train and test set to get started with.

3. Convolutional Neural Network
3.1 Model layers
Now that the dataset has been processed and prepared into a train and test set, I could get started on the neural network model. A Keras neural network model is built up out of layers. The model I created is a convolutional neural network (CNN), a type of artificial neural network (ANN) that is usually used for visual imagery but works great in other fields too. This model is built out of multiple convolutional layers, with normalisation layers in between like I stated in the previous chapter. After the convolutional layers, the data needs to be somewhat reshaped again to make it similar to how it looked after the data preprocessing step. This is because we want to test our model, and our test set of data has not been changed since it does not go through the CNN.
import keras
from keras.layers import Activation
from keras.layers import Conv2D, BatchNormalization, Dense, Flatten, Reshape
def get_model():
"""
Build the layers of the Keras model
Parameters:
None
Returns:
model (Sequential): Sequential CNN Keras model
"""
model = keras.models.Sequential()
model.add(Conv2D(64, kernel_size=(3,3), activation='relu', padding='same', input_shape=(9,9,1)))
model.add(BatchNormalization())
model.add(Conv2D(64, kernel_size=(3,3), activation='relu', padding='same'))
model.add(BatchNormalization())
model.add(Conv2D(128, kernel_size=(1,1), activation='relu', padding='same'))
model.add(Flatten())
model.add(Dense(81*9))
model.add(Reshape((-1, 9)))
model.add(Activation('softmax'))
return model
3.2 Data troubles
While figuring out the layers of the model is an important step, it doesn’t accomplish much when the dataset isn’t what you want it to be. That’s where I ran into quite some trouble. When I used the original dataset to train the model (code seen below), all seemed well to begin with. It had a decent accuracy of around 90%, which isn’t bad since I had not gotten into any sort of tuning yet. However, I wanted to test the model with some real world examples, so I bought some different sudoku puzzle books at the grocery store. They ranged from level 1 difficulty to level 10 difficulty and it turned out that the now trained machine learning algorithm could not solve a single sudoku from level 3 or higher accurately.
# Function to setup the keras model
def setup_new_model(x_train, y_train, x_test, y_test, learning_rate, epochs, batch_size, path):
"""
Setup, compile and train model for further use (if a previously trained model is unavailable).
Parameters:
x_train (list): list of train puzzles
y_train (list): list of train solutions
x_test (list): list of test puzzles
y_test (list): list of test solutions
learning_rate (float): learning rate for Adam optimizer
epochs (int): amount of epochs to be used in training
batch_size (int): batch size to be used in training
Returns:
model (Machine Learning model): The trained model to be used further
"""
# Get the model
model = keras_model.get_model()
# Define the optimizer
optimizer = Nadam(learning_rate = learning_rate)
# Compile and fit the model
model.compile(loss='sparse_categorical_crossentropy',
optimizer=optimizer,
metrics='accuracy')
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=epochs, batch_size=batch_size)
# Save the model
model.save(path)
return model
So, I started testing what was wrong with the dataset and how I could change it. Like I said before, there are 9 million sudokus in the dataset I used, so there’s plenty of room for errors or change if I wanted to. The dataset itself seemed fine, although highly randomised. With the help of one of my Guild members (and particularly his laptop), I was able to determine that the average amount of ‘open’ tiles represented by zeros in the dataset was around 45 with a minimum of 4 and a maximum of 58. I figured then: maybe if I create a more consistent dataset, I could improve the accuracy.
3.3 Data restructuring
I wrote some code that would be able to take a section of the dataset and get the solutions to the puzzles, and from that remove n numbers in random locations and change them to zeros. That way, I could rewrite the full dataset to only consistently have 20 ‘open’ squares for example. I did this for 20, 25, 30 … all the way to 60 open squares. In my logic, perhaps if I trained with a dataset that has 50 ‘open’ squares, then surely the model would be able to solve the higher difficulty booklets I bought? Wrong.
I tried training with all the separate datasets, I even constructed a dataset that had an equal amount of every new dataset that each increased in difficulty — nothing worked. The model kept being able to solve easier sudokus quite well and quite fast, but never above a certain difficulty level. I ended up writing code just so I could test the failures easier, since checking the newly (incorrectly) solved sudoku by hand every time got tiresome. Eventually I was convinced I was doing something wrong, so I downloaded the trained model from the same person who I got the data preprocessing code from. Even their model which performed pretty well when tested randomly, was not able to overcome that “level 3 hump” from my sudoku booklets.
Amount of ‘open’ squares | Accuracy |
---|---|
Sudokus with 20 open squares | 98.4% |
Sudokus with 25 open squares | 96.5% |
Sudokus with 30 open squares | 93.5% |
Sudokus with 35 open squares | 89.7% |
Sudokus with 40 open squares | 83.6% |
Sudokus with 45 open squares | 76.8% |
Sudokus with 50 open squares | 69.1% |
Sudokus with 55 open squares | 60.7% |
Sudokus with 60 open squares | 52.0% |
Eventually I gave up my mission to be able to solve the higher difficulty level booklets or higher difficulty datasets. As you can see, the accuracy pretty much plummets I think that some sort of heuristics/neural network combination combination is necessary to solve these, although I still can’t imagine how neural networks can be trained to differentiate cancer cells — yet somehow is unable to solve sudokus.
3.4 Model optimisation
I then started looking into the model I did have that had a great accuracy for the lower level difficulty sudokus and tried to look into optimising this. During this I was able to conclude that having a batch size of 128 was optimal for this, since it significantly increases training speed while not giving up too much accuracy. This is also backed by research from Weights and Biases [6] as seen in the image right.
Lastly I started looking into the different available optimizers for my Keras CNN. Up until now, I just used an Adam optimizer since that tends to be the go-to when looking at any Keras neural network. There are others however, as listed on Keras [7]: SGD, RMSprop, Adadelta, Adagrad, Adamax, Ftrl, Nadam, and Adam. According to Analytics India Magazine [8], optimizers are classes or methods used to change the attributes of your machine/deep learning model such as weights and learning rate in order to reduce the losses. Optimizers help to get results faster. Most of these optimizers are newer iterations of each other, so most performed pretty similarly.
Eventually I ended up settling on the Nadam optimizer, since it performed slightly better than the others. I used a learning rate of 0.001 and trained for 2 epochs with a batch size of 128.

4. The original solution
Using the backtracking algorithm is the original solution for solving sudokus with computers, so now that my plan of an amazing neural network turned a little bit sour, I wanted to see how it would compare to the original solution. For this I used a backtracking algorithm from a Medium post by A. Jhawar [9], since using backtracking has been around for ages it would not make sense to code this myself since there’s no need to reinvent the wheel.
4.1 Backtracking
According to Programiz [10] a backtracking algorithm is a problem-solving algorithm that uses a brute force approach for finding the desired output. The brute force approach tries out all the possible solutions and chooses the desired/best solutions. The term backtracking suggests that if the current solution is not suitable, then backtrack and try other solutions. Thus, recursion is used in this approach, as seen in the code example below.
def backtrack(x)
if x is not a solution:
return false
if x is a new solution:
# add to list of solutions
backtrack(expand x)
Using this algorithm you can solve all sudokus as long as they have a solution, so the accuracy is always 100%. However, the speed of this algorithm is severely impacted compared to using the CNN. When trying to solve more sudokus and/or more difficult sudokus, the backtracking algorithm is significantly slower compared to the neural network. You can clearly see that the more open squares there are in the sudoku, the more time it takes the backtracking algorithm to solve the sudoku.
Dataset | 10.000 sudokus | 100.000 sudokus | 1.000.000 sudokus |
---|---|---|---|
Sudokus with 20 open squares | 0.7s | 6s | 60s |
Sudokus with 25 open squares | 3s | 30s | 305s |
Sudokus with 30 open squares | 4s | 39s | 399s |
Sudokus with 35 open squares | 5s | 53s | 521s |
Sudokus with 40 open squares | 8s | 85s | 857s |
Sudokus with 45 open squares | 10s | 98s | 1002s |
Sudokus with 50 open squares | 11s | 119s | 1186s |
Sudokus with 55 open squares | 13s | 134s | 1348s |
Sudokus with 60 open squares | 15s | 147s | 1501s |
5. Conclusion
I think the main conclusion I can draw for myself is that while neural networks might be an interesting ‘hot item’ in the world of IT, as a programmer you should still look towards known solutions to problems — sometimes the hot new thing isn’t actually the better solution. This is especially important to remember when dealing with non-IT personnel that is hyped up from buzzwords like neural networks, artificial intelligence, cloud solutions, etcetera. While I know this from experience, it is still good to be reminded of this.
Keras is able to perform fast and accurately on lower difficulty sudokus, and setting up the code for it isn’t all too difficult. However, the accuracy plummets when looking at more difficult sudokus to the point of it being essentially a coin flip whether it is correct or not. While the speed remains the same, obviously you’re not looking for coin flip chances in a neural networks accuracy.
The original solution for solving sudokus with a computer using a backtracking algorithm is highly accurate, since it literally keeps trying until it is correct. However, this takes significantly more time when solving more difficult sudokus or solving more of them. While this isn’t optimal, I would assume accuracy is more important than speed in this problem space.
So while my grand plans didn’t fully come to fruition, specifically using computer vision to solve sudokus, I did get reminded of a valuable lesson to not immediately fall into the trap of buzzwords when considering an IT problem.
Sources
[1] “Will we ever run out of Sudoku puzzles?,” Encyclopædia Britannica. [Online]. Available: https://www.britannica.com/story/will-we-ever-run-out-of-sudoku-puzzles.
[2] “Keras vs tensorflow VS pytorch: Deep Learning Frameworks,” Edureka, 16-Dec-2021. [Online]. Available: https://www.edureka.co/blog/keras-vs-tensorflow-vs-pytorch/.
[3] Vopani, “9 million Sudoku puzzles and solutions,” Kaggle, 14-Nov-2019. [Online]. Available: https://www.kaggle.com/datasets/rohanrao/sudoku.
[4] S. Verma, “Solving sudoku with convolution neural network | keras,” Towards Datascience, 17-Oct-2019. [Online]. Available: https://towardsdatascience.com/solving-sudoku-with-convolution-neural-network-keras-655ba4be3b11.
[5] “6.3. Preprocessing Data,” scikit. [Online]. Available: https://scikit-learn.org/stable/modules/preprocessing.html.
[6] A. Thakur, “What’s the optimal batch size to train a neural network?,” W&B, 19-Aug-2020. [Online]. Available: https://wandb.ai/ayush-thakur/dl-question-bank/reports/What-s-the-Optimal-Batch-Size-to-Train-a-Neural-Network—VmlldzoyMDkyNDU.
[7] “Keras documentation: Optimizers,” Keras. [Online]. Available: https://keras.io/api/optimizers/.
[8] M. Maithani, “Guide to tensorflow keras optimizers,” Analytics India Magazine, 27-Jan-2021. [Online]. Available: https://analyticsindiamag.com/guide-to-tensorflow-keras-optimizers/.
[9] A. Jhawar, “Sudoku solver using opencv and DL - part 2,” Medium, 27-Sep-2020. [Online]. Available: https://aakashjhawar.medium.com/sudoku-solver-using-opencv-and-dl-part-2-bbe0e6ac87c5.
[10] “Backtracking algorithm,” Programiz. [Online]. Available: https://www.programiz.com/dsa/backtracking-algorithm.
Leave a Reply