Handwritten digit classification using Raspberry Pi Pico and Machine Learning

 Handwritten digit classification using Raspberry Pi Pico and Machine Learning

Video of the project in action.

💡Link to project’s repository : https://github.com/code2k13/rpipico_digit_classification

Table of Contents


Our Goal

Required Hardware

Required Software


Understanding OV7670’s image format

Postprocessing camera images

Training ML model

Exporting ML model Pico friendly format

Guidance on setting up the project




In this article, I’ll share how I used a Raspberry Pi Pico, an OV7670 camera module, a 120x160 TFT LCD display and machine learning to build a handwritten digit classification system. This code is highly experimental. Even after you follow all the recommended steps in this article, some tinkering will still be necessary to get it to work. So buckle up and let’s start exploring!

Our Goal

We intend to run a machine learning model on our Raspberry Pi Pico that analyzes photos received from a camera and tries to infer what digit was present in the image. An image of the project in action is shown below:

 Predicted value displayed on LCD

It is important to note that our machine learning model can be executed entirely on a Raspberry Pi Pico; a connected computer or cloud are not required. Therefore, creating compact machine learning models that fit in Pi Pico’s RAM and is accurate enough for our work is a major challenge we are aiming to overcome. This is no easy task. I will discuss this in more detail in a later part of this article.We also want to show our results on an LCD screen. So we use a 120x160 TFT LCD display to show the output to the user.

Lastly, everything needs to be done using CircuitPython. In my personal opinion CircuitPython is easy and fun to work with. The biggest advantage is your code will work with variety of 300+ boards supporting CircuitPython.

Required Hardware

  • Raspberry Pi Pico - This project has only been tested on the Raspberry Pi Pico. Hopefully, it won’t require any code modifications to run on the Raspberry Pi Pico W. As the code is written in CircuitPython, it is expected to work on most boards that support CircuitPython. Almost 80% of the GPIOs on the Pi Pico are used by the project, so make sure you have enough GPIOs on your board. You will need to change the code for the project to work with any other boards.

  • 120x160 TFT LCD - similar to this one . Again, other options might be workable but may call for code changes. Theoretically, an LCD might not even be necessary; data could be written to a serial console instead. But practically speaking, this project might be difficult to implement without a LCD. The placement of the camera and the subject has a significant impact on the output. With an LCD, it is easy to align your handwritten numbers with the camera as you can constantly see what your camera is seeing.

  • OV7670 Camera Module - I purchased this for Rs. 155 (approx USD 2), can you believe it ! Much cheaper than some color sensors !

  • Full sized breadboard (highly recommended)

  • Jumper Cables - May 20 each of M-F and M-M. There are lots of connections to be made !!

Required Software

Any text editor for editing the code if you plan to make any chances. A full Python distribution and pip for training and exporting the machine learning model. And off-course, lots of patience.


There is a LOT of wiring that is required. Using breadboards and jumper cables is highly recommended.
The below table shows connections for the LCD

Display Pin Number Display Pin Name Pi Pico Pins
2 VCC 5V
10 CS GP18
7 A0 GP16
8 SDA GP11
9 SCK GP10
15 LED 3.3V

OV7670 module
The table below shows connections between OV7670 and PiPico

OV7670 Pin name Pi Pico Pin Name
D0 GP0
D1 GP1
D2 GP2
D3 GP3
D4 GP4
D5 GP5
D6 GP6
D7 GP7
SCL GP21 (via 4.7k external pull up resistor)
SDA GP20 (via 4.7k external pull up resistor)

Understanding OV7670s image format

Our camera can take images at various resolution and different formats. For sake of this example, we use the 60*80 resolution and RGB565_SWAPPED format.

cam.size =  cam_size

It is important to understand how every pixel is encoded. Each as a 16 bit integer and is stored in a format called as RGB656_SWAPPED. This means that

  • Red component is captured using 5 bits
  • Green component is captured using 6 bits
  • Blue component with 5 bits

However first we need to ‘unswap‘ the pixel by doing:

pixel_val = ((pixel_val & 0x00FF)<<8) | ((25889 & 0xFF00) >> 8)

Then we find individual components (red, blue and green) using the below code

r = (pixel_val & 0xF800)>>11
g = (pixel_val & 0x7E0)>>5
b = pixel_val & 0x1F

Once that is done we have to convert the pixel value to grayscale, because that is what our model expects. There are multiple ways to do this, but a simple average method works for us.

return (r+g+b)/128

Postprocessing camera images

Let’s spend a moment figuring out what processing we could require. The camera captures 60x80 resolution RGB photos. This image can be made into a 60x80 grayscale using the code above. However, our model only accepts grayscale images that are 12x12. Therefore, we must first trim the image to 60x60 before resizing it to 12x12.

temp_bmp = displayio.Bitmap(cam_height, cam_height, 65536)
for i in range(0,cam_height):
for j in range(0,cam_height):
temp_bmp[i,j] = camera_image[i,j]

After extensive testing, I also discovered that photos can contain considerable noise for a number of reasons. This little thresholding function does a great job of improving our prediction:

input_data = []
for i in range(0,12):
for j in range(0,12):
x = 1 -rgb565_to_1bit(inference_image[i,j])
if x < 0.5:
x = 0

Given are images, the first one is without the thresholding code and second one is with thresholding. This is how our post-processed image which is sent to the ML model looks like:

Post processed image of digit 5

Training ML model

You will most likely encounter the terms “nerual networks” and “convolutional neural networks” if you read research on image classification using machine learning, and this is very appropriate. When it comes to utilising machine learning to process images, convolutional neural networks are the gold standard. The do an excellent job with pictures. However, the issue with neural networks in general is that they are complex, require strong number-crunching technology, and require a respectable amount of memory to operate. It can be quite difficult to adapt a neural network to operate on a tiny microcontroller like the RP2040. For some leading edge work in this area, you should probably check out TFLite mini.

There are other machine learning (ML) methodologies that we may use that, while obviously giving less accuracy and reliability, are still obviously simple enough to function on microcontrollers. One of them is a support vector machine (SVM). It is a method of supervised learning that performs well with high dimensional data. We will utilise the “LinearSVM” implementation of this technique available in the scikit-learn Python package.

Given that this is a supervised learning problem, training data is required. Here, we’ll just train our model using the MNIST handwritten digit dataset that comes with scikit-learn. The model is trained using the code below (and needs to be run on a computer running normal Python distribution)

img_data = np.asarray(image_ds)
flattened_data = img_data.reshape((len(image_ds), -1))
clf = svm.LinearSVC()
X_train, X_test, y_train, y_test = train_test_split(
flattened_data, data.target, test_size=0.4, shuffle=False
clf.fit(X_train, y_train)

I would strongly advise training and exporting the model using the Python notebook from this project’s repository (explained in next section). Here is the link to the notebook on Kaggle: https://www.kaggle.com/finalepoch/handwritten-digit-classification-on-pi-pico

Exporting ML model to Pico friendly format

The model we developed in the previous section can only be used with the scikit-learn package and a standard Python installation. The model cannot be run on our Pi Pico using CircuitPython. Therefore, we must export the model in a format that Pi Pico can understand. We intend to make our model as compact as possible because Pi Pico has a little amount of RAM. I decided to utilize an input image size of 12x12 pixels for this reason because I discovered it worked best with my setup.

Fortunately, there is an open source tool called m2cgen that can translate our learned scikit-learn model into “pure python” code which CircuitPython can run. Depending on the model architecture and hyper-parameters, this tool can produce rather large files. To reduce the size of the Python code, we use another tool called python-minimizer.

We can generate a minimized python file containing our model using these two commands

pip install m2cgen
pip install python-minimizer
code = m2c.export_to_python(clf)
with open('svm.py', 'w') as f:
python-minimizer svm.py -o svm_min.py

Guidance on setting up the project

For guidance on running the code and software dependencies refer to project’s repository.

Link to project’s repository : https://github.com/code2k13/rpipico_digit_classification

An image of the setup I use for my experiments is shown below. Make sure the subject and the camera are at the same height. The LCD display aids in lining up the camera and the handwritten digit. For the approach to function, it is crucial to have good lighting, perfect alignment, and stability.


  • Check if your LCD is displaying output from camera properly. The camera module has a plastic lens cover, be sure to remove it.

  • Check if your camera is focussed properly. The OV7670 needs to be focussed manually by gently rotating the lens assembly.

  • Uncomment the line below ‘#Uncomment these lines for debugging‘ in code.py, this will make the board output image matrix being sent to ML model. You can simply copy the array and use the Kaggle Notebook’sTesting our model‘ section to visualize the image and check prediction for it using standard Python distribution. Play around with thresholding code if necessary.
     Copy output from serial console to the python notebook as shown and check the prediction

  • The model is very sensitive to size of digits need to be right sized. Make sure that your handwritten digits fit the whole display frame in the LCD display. On the LCD display you should see the full digit in black on a pure white background.

  • My experiments suggest that using the digit ‘8’ to calibrate settings alignment gives great results.

  • Ensure you have proper lightning.


There are several experiments one could do even with such a simple project like this. Some of the ideas I plan to work on in future :

  • Training the model using data obtained from camera.
  • Good memory management in code and thus enabling larger image sizes.
  • Train the model on some other fun dataset (shapes, sings ,patterns).
  • Convert the whole thing to ‘C/C++’ (using Pi Pico SDK and TFLite micro)

Text Generation on Raspi Pico using Markov Chains

Random dinosaur names generated by Raspi Pico displayed on OLED screen

Here is the link to project’s Github repo: https://github.com/code2k13/text_gen_circuitpython

My experiments with TFlite for microcontrollers

I enjoy tinkering with microcontrollers and other programmable gadgets. I’ve been paying great attention to TinyML and TFlite lately. I’ve always wanted to port some of my machine learning models to microcontroller devices. I recently got a Raspi Pico and began experimenting with it. I wanted to play around with text generation models on it. I intended to write a lightweight character-based RNN and run it on my Raspberry Pi Pico to show some randomly generated text on a 0.96-inch OLED display. When it came to putting this strategy into action, I ran into a slew of issues.

To begin, TFLite for microcontrollers is written in C++. To run it on a Pi Pico, you’ll need a full-fledged C++ programming environment as well as a well configured and setup Pi Pico SDK. When creating several of the TFLite examples for Pi Pico, I ran into a number of problems using CMAKE. When I believed I had properly built the code, I copied it to my Pi Pico, but nothing occurred. Debugging is a difficult task that I have not done and do not presently have the tools to perform.

CircuitPython, my friend !

After having a lot of troubles with C++, I switched to CircuitPython. However, the issue is that CircuitPython currently lacks TFLite bindings. However, there is an open source project in this direction: https://github.com/mocleiri/tensorflow-micropython-examples, but it does not support Pi Pico. I decided to look for alternate text creation methods while sticking with CircuitPython. This is when I thought of ‘Markov Chains‘. Simply described, a Markov Chain is a state transition map that comprises a list of alternative states from current states along with probabilities.

In fact, there is a Python module called markovify that can generate a Markov Chain from text and use it to draw inferences.Given below are some sample sentences generated using ‘markovify‘, using the text from book ‘Alice in Wonderland

Alice said nothing; she had found the fan and gloves.
She was walking by the time when she caught it, and on both sides at once.
She went in search of her sister, who was gently brushing away some dead leaves that had a consultation about this, and after a few things indeed were really impossible.
And so it was too slippery; and when Alice had begun to think that very few things indeed were really impossible.
Alas! it was as much as she spoke, but no result seemed to have lessons to learn!
They are waiting on the bank, with her arms round it as a cushion, resting their elbows on it, and talking over its head.

After a while of experimenting with markovify I grew to appreciate this basic but effective method (although not as effective as neural networks). So I made a ‘pure’ CircuitPython implementation for generating Markov Chains and inferring from them.

The src/generate_chain.py file in the repository must be executed from a standard Python interpreter (it will not work with CircuitPython). From any text file, this file builds a character-based Markov Chain model. By default, it searches for CSV files containing dinosaur names (which must be downloaded separately) and creates dino chain.json.

import re
from collections import Counter
import json

#You will need the supply a text file containing one name per line.
#A sample one can be downloaded from
with open("dinosaurs.csv") as f:
txt = f.read()
unique_letters = set(txt)

chain = {}
for l in unique_letters:
next_chars = []
for m in re.finditer(l, txt):
start_idx = m.start()
if start_idx+1 < len(txt):
chain[l] = dict(Counter(next_chars))

with open("dino_chain.json", "w") as i :
json.dump(chain, i)

You may use any text file as an input to this script, but a file containing small text per line produce the best results (like names of places, people, animals, compounds etc). A character-based chain will not work well with text file containing long sentences. We must utilize a word or n-gram-based chain for them. The resulting Markov Chain json (dino_chain.json) file should look like this:

"x": {
"\n": 39,
"a": 9,
"i": 34,
"l": 1,
"o": 2,
"y": 1,
"e": 4,
"u": 3
"z": {
"s": 1,
"o": 4,
"i": 5,
"u": 6,
"e": 3,
"a": 13,
"h": 21,
"k": 3,
"y": 1,
"r": 1,
"b": 1

It is, in essence, a dictionary of dictionaries. The first level keys contain the text file’s unique set of characters. Every character is mapped to a second level dictionary that contains characters that appear immediately after the first level key character as well as the number of times it was encountered. In the preceding JSON, for example, we can see that ‘x’ is followed by ‘n’ (newline) 39 times and ‘a’ 9 times, and so on. It’s as straightforward as that.

To produce inferences (random text), just utilize this JSON structure and convert the counts to probabilities to forecast the next character for a given character. We could have used random.choices in Python to do this, but it is not available in CircuitPython. To do this, I wrote my own function (_custom_random_choices). It is present in the src/markov chain parser.py file , which is basically a utility script for generating inferences. To generate inferences, use the src/generate_text.py file, which contains the following code:

import json
import time
from markov_chain_parser import generate_text

with open('dino_chain.json', 'r') as f:
dino_chain = json.loads(data)

while True:
a = generate_text(200,dino_chain);
names = a.split("\n")
for n in names:
if len(n) > 4 and len(n) < 15:

We solely rely on CircuitPython’s random, json, and time modules to create inferences; we have no additional dependencies. That’s it; in order for text generation to work, you must copy the following files to Pi Pico:

  • dino_chain.json
  • markov_chain_parser.py
  • generate_text.py

I hope you found the article interesting. You may also use the src/generate_text_oled.py file to show the generated text on an OLED display.

Sample implementation of CQRS architectural pattern

Block diagram onf the sample implementation.

I was recently rereading some software architecture literature. One of the frequent patterns whose names appear in the majority of publications on architectural design patterns is the CQRS pattern. The majority of these sources contain source code and information about this pattern. I therefore made the decision to produce a working prototype. The CQRS (Command Query Responsibility Segregation) architectural pattern is implemented in this opinionated sample implementation on Kubernetes utilising NodeJS, Redis, and Helm. This should not be used in production and is solely intended for learning and experimenting. Basically, it consists of three parts:

  • A web app written in NodeJS (which can be thought of a microservice)
  • A ‘write’ datastore (implemented using Redis master instance)
  • A ‘read’ datastore (implemented using Redis replica instances)

The goal was to develop a simple sandbox implementation that is as beneficial for learners and hobbyists and is as realistic as possible. To give you a sense of how the example would work in the real world, I used Kubernetes and Helm for the implementation.

Visualizing and analyzing website contents using AI

image of visualization

Sometime ago I was wondering if it was possible to create a bird’s eye view of a website. A view, that will showcase contents of a website visually and group similar pages/posts. Also I thought it would be cool if I could navigate the website using this view. I could discover more things faster.

Recently I ended up creating ‘Feed Visualizer‘. Feed Visualizer is a tool that can cluster RSS/Atom feed items based on semantic similarity and generate interactive visualization. This tool can be used to generate ‘semantic summary’ of any website by reading it’s RSS/Atom feed. Here is the link to project’s homepage https://github.com/code2k13/feed-visualizer

Here are links to couple of cool interactive visualizations I created using this tool:

Checkout the tool and generate some cool visualizations yourself. If you like this tool please consider giving a ⭐ on github !

Simpler and safer method of diatom collection and observation using a Foldscope (or any optical microscope)

Diatoms from a single slide, collected and observed using the method described in this article. The sample was not known to contain diatoms it was collected for studying algae.

🔬This article was originally written as a guide for Foldscope users, but it should work well for all types of optical microscopes.

👏This work was possible due to lot of support and guidance from Karthick Balasubramanian of the D3 LaB. He is a expert on diatoms and has provided valuable advice for carrying these experiments

📌 Safety Considerations

Even though the experiments mentioned in the below article use household chemicals like laundry bleach and adhesives, these can be harmful if used inappropriately. Please read all safety considerations marked by ‘⚠️’ at important places in the article. Proceed with the experiments ONLY IF YOU UNDERSTAND ALL THE SAFETY CONSIDERATIONS. Adult supervision is recommended.

Diatoms are some of the most interesting things you can see under a foldscope/microscope. They generate about 20-50% oxygen on earth every year.
They are also known as “jewels of the sea” and “living opals”. It is believed that there are around 200000 species of diatoms. Identification of diatoms can be done by observing the shape and marks on their ‘frustules’. A frustule is the siliceous part of a diatom cell wall. Under a microscope they look like different shaped glass beads with various engravings on them.

To observe these frustules, we have to ‘clean’ the diatoms. We need to remove pigments and matter inside the diatom cell, so that we observe these ‘frustules’ under a microscope and identify the species of diatoms. There are well established laboratory methods for doing this. These involve use of hazardous chemicals , acids and heat. These methods are not suitable for use in a school or at home. This article explains a simpler and much safer method for cleaning and mounting diatoms.

What you will need

You will need the following items:

  • Centrifuge Tubes (easily available online)
  • Pipettes (easily available online)
  • Fevi kwik (transparent general purpose instant adhesive)
  • Vanish (or similar laundry bleach/detergent additive)
  • Piece of strong thread
  • Glass slides and coverslips
  • Water sample containing diatoms
  • Patience

Collecting Diatoms

Diatoms are present in abundance in most saltwater and freshwater bodies. For the purpose of this experiment I collected a small sample of water from an artificial pond . The pond was filled with algae, and I wasn’t aware if it had any diatoms in it. You can try collecting water from different places and check it for diatoms. You can also follow instructions mentioned in this YouTube video: Diatoms. Part 1: Introduction and collection of diatoms – YouTube

Cleaning Diatoms

This is the most important and time consuming part of the process. Cleaning helps remove matter inside the diatom cell and separate diatoms from other debris. Labs use acids and other chemicals for this process. We will use a laundry bleach (Vanish brand available in India) to do the same. This particular item is sold as a detergent additive and claims to have 10x more oxidizing power compared to a normal detergent (just what we want !).

⚠️ Adult supervision is recommended. Be very careful when working with bleach of any kind. Wear gloves if you are not sure how strong the bleach is. Avoid spilling bleach on your skin and other body parts. Read information on the packet/box before using. Consult an expert if you are unsure.

Here are the steps for cleaning the diatoms:


Take about 30ml of water and mix 1/4 teaspoon of the detergent additive (Vanish) in it, in an OPEN container. Make sure it is fully dissolved. The quantity of water and amount of bleach would depend on the strength of your bleach.

⚠️ DO NOT keep this solution in a closed or airtight container, because gases can build up pressure inside the container.

Oxidation of the sample causes generation of bubbles and release of gas. Keep the tube open.


Take a few drops of water from the sample you have collected using a pipette in a centrifuge tube. Try to take few drops from the bottom of the sample. It is fine if there is some algae and debris in it.


Add 10 – 15 drops of the bleach solution we created in STEP 1 and let the solution settle for some time, say 60 minutes.

⚠️ DO NOT close the cap on this centrifuge tube. You should see air bubbles inside the tube as the sample gets oxidised.


Once there are no more air bubbles or their generation has slowed down considerably, we need to centrifuge the mixture. Before that we should remove all the excess liquid very carefully from the tube using a pipette. Just leave behind the settled in debris at the bottom and only a 0.5 cm deep layer of water on top of it. Once we have gotten rid of most of the bleach, we should fill half of the tube with clean water.

DIY centrifuge !!

⚠️ Excess gas buildup in closed tube can be dangerous. We are getting rid of most of the bleach before centrifuging the mixture. This should ensure that gas build up during the centrifugation process is minimal and mitigates any risk.


Screw in the cap on the centrifuge tube. Tie a 30-40 cm long thread tightly just below the cap as shown in the below figure.


Hold one end of the thread in your hand and start moving the tube in circles in a horizontal plane as shown in below the figure for a minute.

Operating the DIY centrifuge !!

⚠️ Make sure the thread is strong and tightly tied around the base of the centrifuge cap. Also ensure that you do this in open space to avoid hurting anyone or hitting anything.


Remove the cap of the centrifuge tube carefully and put it back on after a few seconds to let any gases escape. Keep it standing still on its pointed end for about 10 minutes by resting it vertically against a wall. You should see matter settling down at the bottom of the tube.


At this stage remove the cap of the centrifuge tube again and very carefully remove excess water using a pipette while leaving the matter at the base of the tube intact. I generally keep a thin layer of water (0.5 cm) above the settled debris. Now fill the tube carefully with clean water so that it is half full. Repeat steps 6,7 and 8 two more times to wash out any bleach or soap remaining inside the tube.


Remove all excess water , keeping a thin layer of water (say 0.5 cm) above the settled down matter. Keep the tube in a vertical position all the time, do not invert the tube.

Congratulations !! You have cleaned your first batch of diatoms. It is now time to mount them on a slide.

Creating the slide

Once the mixture in the tube has settled down, it is time to create a slide. You can create a slide from this sample without using any mounting medium. But I would highly recommend using a mounting medium for two reasons:

In the Foldscope (as opposed to conventional microscopes), slides are held vertically when viewing.The coverslip and slide are pressed hard together by two magnets on both sides. This can put a lot of pressure on cover slip, especially during focusing. Diatom frustules are fragile and may break.

There are many high refractive index mounting mediums which are used for diatoms like Naphrax, Zrax, Pleurax and Hyrax , which may be hard to find (depending on where you live) and expensive. Also heating is involved when using them and toxic fumes are released on heating them. I tried to use Pleurax and heated the slide using a candle. The medium did not harden and my slide was sticky and a total mess to work with. This is why I recommend using ‘Fevi kwick’ or any of those general purpose transparent liquid instant adhesives (these go by different brand names). These are cheap and readily available. But there is one problem, they harden very fast, in a matter of a couple of seconds. You have to be very precise with your slide placement in the first attempt itself. You won’t be able to adjust the slide after it makes contact with the adhesive.

⚠️ Please read information on the adhesive packet about safety and usage. Wear gloves if required. These things are super sticky and are not easily removable if they come in contact with your skin or fingers.

To mount the diatom specimen follow these steps:


Using a pipette, take only a couple of drops of water from the very end of centrifuge tube containing the cleaned diatoms. This can be tricky. If you take in more water , hold the pipette still in a vertical position for a couple of minutes.


Now carefully place a drop of this water from the pipette onto the centre of an empty glass slide. Spread the drop slightly using the same pipette so as to cover area equal to that of a cover slip.


Let the slide dry. This can take a few minutes. You should see white powdery residue on the slide. This residue contains our diatoms.

✔️ If you don’t want to work with adhesives , you can skip steps 4,5 and 6. Instead just place a cover slip on the white powdery portion, secure the sides with transparent tape and move to step 7.


Now very carefully place a drop of adhesive on top of the powdery spot on the slide.


Hold the coverslip using tongs (recommended) or your fingernails (wearing gloves recommended) and place it on the adhesive drop. Immediately the tap center of the coverslip gently using a blunt object.


The coverslip will stick to the slide within a couple of seconds. If you made a mistake you will have to start over with new a slide and coverslip. (Always save some sample for such scenarios).


Congratulations !! You have created a slide out of your diatom sample.

Permanent slide using transparent instant adhesive.

Viewing the Slide and Taking Pictures

If you plan to take pictures of these diatoms, make sure you have the LED Magnifier attachment for Foldscope.
LED Magnifier Kit – (Contains 20 LED/Magnifier Lights. This does not i – Foldscope Instruments, Inc. (teamfoldscope.myshopify.com)

Proper lighting and focus is very important for capturing the details of diatoms. From my personal experience I have found some mobile phone cameras are better at focusing when used with Foldscope compared to others. Take lots of pictures and upload the best ones here on microcosmos. Below are some more images taken from a single slide created using the technique explained in this article.

Hope this information was useful to you all. Please share your comments and suggestions. Happy diatom hunting !!


Diatoms. Part 1: Introduction and collection of diatoms – YouTube

Diatoms. Part 2: Preparation of permanent slide) – YouTube

Microjewels. – Microcosmos (foldscope.com)

Creating scalable NLP pipelines using PySpark and Nlphose

In this article we will see how we can use Nlphose along with Pyspark to execute a NLP pipeline and gather information about the famous journey from Jules Verne’s book ‘Around the World in 80 days‘. Here is the link to the ⬇️ Pyspark notebook used in this article.

From my personal experience what I have found is data mining from unstructured data requires use of multiple techniques. There is no single model or library that typically offers everything you need. Often you may need to use components written in different programing languages/frameworks. This is where my open source project Nlphose comes into picture. Nlphose enables creation of complex NLP pipelines in seconds, for processing static files or streaming text, using a set of simple command line tools.You can perform multiple operation on text like NER, Sentiment Analysis, Chunking, Language Identification, Q&A, 0-shot Classification and more by executing a single command in the terminal. Spark is widely used big data processing tool, which can be used to parallelize workloads.

Nlphose is based on the ‘Unix tools philosophy‘. The idea is to create simple tools which can work together to accomplish multiple tasks. Nlphose scripts rely on standard ‘Filestreams’ to read and write data and can be piped together to create complex natural language processing pipelines. Every scripts reads JSON from ‘Standard Input’ and writes to ‘Standard Output’. All the script expect JSON to be encoded in nlphose compliant format and the output also needs to be nlphose compliant. You can read more details about the architecture of Nlphose here.

How is Nlphose different ?

Nlphose by design does not support Spark/Pyspark natively like SparkML. Nlphose in Pyspark relies on ‘Docker’ being installed on all nodes of the Spark cluster. This is easy to do when creating a new cluster in Google Dataproc (and should be same for any other Spark distribution). Apart from Docker Nlphose does not require any other dependency to be installed on worker nodes of the Spark cluster. We use the ‘PipeLineExecutor’ class as described below to execute Nlphose pipeline from within Pyspark.

Under the hood, this class uses the ‘subprocess‘ module to spawn a new docker container for a Spark task and executes a Nlphose pipeline. I/O is performed using stdout and stdin. This sounds very unsophisticated but that’s how I build Nlphose. It was never envisioned to support any specific computing framework or library. You can run Nlphose in many different ways, one of which is being described here.

Let’s begin

First we install a package which we will use later to construct a visualization.

!pip install wordcount

The below command downloads the ebook ‘Around the world in 80 days’ from gutenber.org and uses a utility provided by Nlphose to segment the single text file into line delimited json. Each json object has an ‘id’ and ‘text’ attribute.

!docker run code2k13/nlphose:latest \
/bin/bash -c "wget https://www.gutenberg.org/files/103/103-0.txt && ./file2json.py 103-0.txt -n 2" > ebook.json

We delete all docker containers which we no longer need.

!docker system prune -f

Let’s import all the required libraries. Read the json file which we created earlier and convert it into a pandas data frame. Then we append a ‘group_id’ column which assigns a random groupId from 0 to 3, to the rows. Once that is done we create a new PySpark dataframe and display some results.

import pandas as pd
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import udf

df_pd = pd.read_json("ebook.json",lines=True)
df_pd['group_id'] = [i for i in range(0,3)]*347
df= spark.createDataFrame(df_pd)
df= spark.createDataFrame(df_pd)
|file_name| id| text|group_id|
|103-0.txt|2bbbfe64-7c1e-11e...|The Project Gute...| 0|
|103-0.txt|2bbea7ea-7c1e-11e...| Title: Around th...| 1|
|103-0.txt|2bbf2eb8-7c1e-11e...|IN WHICH PHILEAS ...| 2|
|103-0.txt|2bbfdbd8-7c1e-11e...| Certainly an Eng...| 0|
|103-0.txt|2bbff29e-7c1e-11e...| Phileas Fogg was...| 1|
|103-0.txt|2bc00734-7c1e-11e...| The way in which...| 2|
|103-0.txt|2bc02570-7c1e-11e...| He was recommend...| 0|
|103-0.txt|2bc095f0-7c1e-11e...| Was Phileas Fogg...| 1|
|103-0.txt|2bc0ed20-7c1e-11e...| Had he travelled...| 2|
|103-0.txt|2bc159d6-7c1e-11e...| It was at least ...| 0|
|103-0.txt|2bc1a3be-7c1e-11e...| Phileas Fogg was...| 1|
|103-0.txt|2bc2a2aa-7c1e-11e...|He breakfasted an...| 2|
|103-0.txt|2bc2c280-7c1e-11e...| If to live in th...| 0|
|103-0.txt|2bc30b3c-7c1e-11e...| The mansion in S...| 1|
|103-0.txt|2bc34dd6-7c1e-11e...| Phileas Fogg was...| 2|
|103-0.txt|2bc35f88-7c1e-11e...|Fogg would, accor...| 0|
|103-0.txt|2bc3772a-7c1e-11e...| A rap at this mo...| 1|
|103-0.txt|2bc3818e-7c1e-11e...| “The new servant...| 2|
|103-0.txt|2bc38e0e-7c1e-11e...| A young man of t...| 0|
|103-0.txt|2bc45c6c-7c1e-11e...| “You are a Frenc...| 1|

Running the nlphose pipeline using Pyspark

As discussed earlier Nlphose does not have native integration with Pyspark/Spark. So we create a class called ‘PipeLineExecutor’ that starts a docker container and executes a Nlphose command. This class communicates with the docker container using ‘stdin’ and ‘stdout’. Finally when the docker container completes execution, we execute ‘docker system prune -f’ to clear any unused containers. The ‘execute_pipeline’ method writes data from dataframe to stdin (line by line), reads output from stdout and returns a dataframe created from the output.

import subprocess
import pandas as pd
import json

class PipeLineExecutor:
def __init__(self, nlphose_command,data,id_column='id',text_column='text'):
self.nlphose_command = nlphose_command
self.id_column = id_column
self.text_column = text_column
self.data = data

def execute_pipeline(self):
prune_proc = subprocess.Popen(["docker system prune -f"],shell=True)

proc = subprocess.Popen([self.nlphose_command],shell=True,stdout=subprocess.PIPE, stdin=subprocess.PIPE,stderr=subprocess.PIPE)
for idx,row in self.data.iterrows():

output,error = proc.communicate()
output_str = str(output,'utf-8')
output_str = output_str
data = output_str.split("\n")
data = [d for d in data if len(d) > 2]
prune_proc = subprocess.Popen(["docker system prune -f"],shell=True)
return pd.DataFrame(data)

The Nlphose command

The below command does multiple things:

  • It starts a docker container using code2k13/nlphose:latest image from dockerhub.
  • It redirects stdin, stdout and stderr of host into docker container.
  • Then it runs a nlphose command inside the docker container which performs below operations on json coming from ‘stdin’ and writes output to ‘stdout’:
command =  '''
docker run -a stdin -a stdout -a stderr -i code2k13/nlphose:latest /bin/bash -c "./entity.py |\
./xformer.py --pipeline question-answering --param 'what did they carry?'

The below function formats the data returned by PipelineExecutor task. The dataframe returned by ‘PipelineExecutor.execute_pipeline’ has a string column containing output from the Nlphose command. Each row in the dataframe represents a line/document output from the Nlphose command.

def get_answer(row):
x = json.loads(row[0],strict=False)
row['json_obj'] = json.dumps(x)
if x['xfrmr_question_answering']['score'] > 0.80:
row['id'] = str(x['id'])
row['answer'] = x['xfrmr_question_answering']['answer']
row['id'] = str(x['id'])
row['answer'] = None

except Exception as e:
row['id'] = None
row['answer'] = "ERROR " + str(e) #.message
row['json_obj'] = None

return row

The below function, creates a ‘PipeLineExecutor’ object, passes on data to it and then calls the ‘execute_pipeline’ method on the object. Then it uses the ‘get_answer’ method to format the output of the ‘execute_pipeline’ method.

def run_pipeline(data):
nlphose_executor = PipeLineExecutor(command,data,"id","text")
result = nlphose_executor.execute_pipeline()
result = result.apply(get_answer,axis=1)
return result[["id","answer","json_obj"]]

Scaling the pipeline using PySpark

We use the ‘applyInPandas‘ from PySpark to parallelize and process text at scale. PySpark automatically handles scaling of the Nlphose pipeline on the Spark cluster. The ‘run_pipeline’ method is invoked for every ‘group’ of the input data. It is important to have appropriate number of groups based on number of nodes so as to efficiently process data on the Spark cluster.

output = df.groupby("group_id").applyInPandas(run_pipeline, schema="id string,answer string,json_obj string")

Visualizing our findings

Once we are done executing the nlphose pipeline, we set out to visualize our findings. I have created two visualizations:

  • A map showing places mentioned in the book.
  • A word cloud of all the important items the characters carried for their journey.

Plotting most common locations from the book on world map

The below code extracts latitude and longitude information from Nlphose pipeline and creates a list of most common locations.

💡 Note: Nlphose entity extraction will automatically guess coordinates for well known locations using dictionary based approach

 def get_latlon2(data):
json_obj = json.loads(data)
if 'entities' in json_obj.keys():
for e in json_obj['entities']:
if e['label'] == 'GPE' and 'cords' in e.keys():
return json.dumps({'data':[e['entity'],e['cords']['lat'],e['cords']['lon']]})
return None

get_latlon_udf2 = udf(get_latlon2)
df_locations = output.withColumn("locations",get_latlon_udf2(output["json_obj"]))
top_locations = df_locations.filter("`locations` != 'null'").groupby("locations").count().sort(desc("count")).filter("`count` >= 1")

Then we use ‘geopandas’ package to plot these locations on a world map. Before we do that we will have to transform our dataframe to the format understood by ‘geopandas’. This is done by applying the function ‘add_lat_long’

def add_lat_long(row):
obj = json.loads(row[0])["data"]
row["lat"] = obj[1]
row["lon"] = obj[2]
return row
import geopandas

df_locations = top_locations.toPandas()
df_locations = df_locations.apply(add_lat_long,axis=1)

gdf = geopandas.GeoDataFrame(df_locations, geometry=geopandas.points_from_xy(df_locations.lon, df_locations.lat))
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
ax = world.plot(color=(25/255,211/255,243/255) ,edgecolor=(25/255,211/255,243/255),
linewidth=0.4,edgecolors='none',figsize=(15, 15))

If you are familiar with this book, you will realize we have almost plotted the actual route taken by Fogg in his famous journey.

For reference, shown below is the image of actual route taken by him from wikipedia

Around the World in Eighty Days map

Roke, CC BY-SA 3.0, via Wikimedia Commons

Creating a word cloud of items carried by Fogg’s on his Journey

The below code finds the most common items carried by the travellers using ‘extractive question answering’ and creates a word cloud.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

figure(figsize=(12, 6), dpi=120)
wordcloud = WordCloud(background_color='white',width=1024,height=500).generate(' '.join(output.filter("`answer` != 'null'").toPandas()['answer'].tolist()))


One of the reasons why PySpark is a favorite tool of data scientists and ML practitioners is because working with dataframes is very convenient. This article shows how we can run Nlphose on a Spark cluster using PySpark. Using the approach described in this article we can embed Nlphose pipeline as part of our data processing pipelines very easily. Hope you liked this article. Feedback and comments are always welcomed, thank you !

Creating scalable NLP pipelines with Nlphose, Kafka and Kubernetes

A typical Natural Language Processing(NLP) pipeline contains many inference tasks by employing models created in different programing languages and frameworks. Today there are many established ways to deploy and scale inference mechanisms based on a single ML model. But scaling a pipeline is not that simple. In this article we will see how we can create a scalable NLP pipeline that can theoretically processes thousands of text messages in parallel using my open source project Nlphose, Kafka and Kubernetes.

Introduction to nlphose

Nlphose enables creation of complex NLP pipelines in seconds, for processing static files or streaming text, using a set of simple command line tools. No programing needed! You can perform multiple operation on text like NER, Sentiment Analysis, Chunking, Language Identification, Q&A, 0-shot Classification and more by executing a single command in the terminal. You can read more about this project here https://github.com/code2k13/nlphose. Nlphose also has a GUI query builder tool that allows you to create complex NLP pipelines via drag and drop, right inside your browser. You can checkout the tool here : https://ashishware.com/static/nlphose.html

Create a nlphose pipeline

Recently I added two components to my project:

  • Kafka2json.py - This tool is similar to kafkacat. It can listen to a Kafka topic and stream text data to the rest of the pipeline.
  • Kafkasink.py - This tool acts like a ‘sink’ in the NLP pipeline and writes the output of the pipeline to a Kafka topic.

In this article we will use the graphical query builder tool to create a complex NLP pipeline which looks like this :

The pipeline does following things:

  • Listen to a topic called ‘nlphose’ on Kafka
  • Named Entity Recognition (automatic geo-tagging of locations)
  • Language identification
  • Searches for answer of the question “How many planets are there?”, using extractive question answering technique.
  • Writes output of the pipeline to a Kafka topic ‘nlpoutput’

This is what the Nlphose command for the above pipeline looks like :

./kafka2json.py -topic nlphose -endpoint localhost:9092 -groupId grp1 |\
./entity.py |\
./lang.py |\
./xformer.py --pipeline question-answering --param 'How many planets are there?' |\
./kafkasink.py -topic nlpoutput -endpoint localhost:9092

Configuring Kafka

This article assumes you already have a Kafka instance and have created two topics ‘nlphose’ and ‘nlpoutput’ on it. If you do not have access to a Kafka instance, please refer to this quick start guide to setup Kafka on your development machine. The Kafka instance should be accessible to all the machines or Kubernetes cluster which are running the Nlphose pipeline. In this example, the ‘nlphose’ topic had 16 partitions. We will use multiple consumers in a single group to subscribe to this topic and process messages concurrently.

Deploying the Nlphose pipeline on Kubernetes

Nlphose comes with a docker image containing all the code and models required for it to function. You will simply have to create a deployment in Kubernetes to run this pipeline. Given below is a sample deployment file:

apiVersion: apps/v1
kind: Deployment
name: nlphose
app: nlphose
replicas: 6
app: nlphose
app: nlphose
- name: nlphose
image: code2k13/nlphose:latest
command: ["/bin/sh"]
#args: ["-c", "while true; do echo Done Deploying sv-premier; sleep 3600;done"]
#command: [" "]
#args: ["pwd "]
args: ["-c","cd /usr/src/app/nlphose/scripts && ./kafka2json.py -topic nlphose -endpoint -groupId grp1 | ./entity.py | ./lang.py | ./xformer.py --pipeline question-answering --param 'How many planets are there ?' | ./kafkasink.py -topic nlpoutput -endpoint"]

Please ensure that you change ‘’ to correct IP and port of your Kafka instance.

Deploying the Nlphose pipeline without Kubernetes

Kubernetes is not mandatory to create scalable pipelines with Nlphose. As a Nlphose pipeline is just a set of shell commands, you can run these on multiple bare metal machines or VMs or PaaS solutions. When used with Kafka plugins, Nlphose is able to concurrently processes Kafka messages, so long as the same groupId is used on all computers running the pipeline. Simply copy paste output from the GUI query builder and paste it in the terminal of the target machines. You have a cluster of computers executing the pipeline concurrently !

Benchmarking and results

I create a small program that continuously posted below message on Kafka topic ‘nlphose’:

message = {
"id": "uuid",
"text": "The solar system has nine planet. Earth is the only plant with life.Jupiter is the largest planet. Pluto is the smallest planet."

The corresponding processed message generated by the pipeline looks like this:

"id": "79bde3c1-3e43-46c0-be00-d730cd240d5a",
"text": "The solar system has nine planets. Earth is the only plant with life.Jupiter is the largest planet. Pluto is the smallest planet.",
"entities": [
"label": "CARDINAL",
"entity": "nine"
"label": "LOC",
"entity": "Earth"
"label": "LOC",
"entity": "Jupiter"
"lang": "en",
"xfrmr_question_answering": {
"score": 0.89333176612854,
"start": 21,
"end": 25,
"answer": "nine"

The below graph shows results from experiments I performed on GKE and a self hosted Kafka instance. It can be clearly seem from the below chart that the time required to process request goes down with addition of more instances.


With the addition of new Kafka plugins Nlphose has become a true big data tool. I am continuously working on making it more useful and easy to operate for end users. Nlphose can work with Kubernetes, Docker or bare metal. Also, you can customize or add new tools to these pipelines easily if you stick to the guiding principles of the project. Hope you found this article useful. For any suggestions or comments please reach me on Twitter or Linkedin.

Free and open source tool for star removal in astronomy images

When you look at the images of galaxy and nebulae on the internet, you won’t realize that stars are generally absent or very dim. In reality that is not the case. There are tons of stars surrounding these objects. Special tools are used to remove these stars and make the images look pretty. Starnet is one of the popular tools for star removal available online. Being familiar with AI/ML, I decided to write my own neural network based tool for star removal.

You can download the tool here : https://github.com/code2k13/starreduction

Here are some before-after images from my collection:

📥 Download the tool here: https://github.com/code2k13/starreduction

Here are some more samples generated by my tool:
Original and starless image of  Rim Nebula
Original and starless image of Veil Nebula

These days I have found a new hobby for myself “post processing astronomical” images. I have always been fond of astronomy. In fact I even own a small telescope, but I don’t have access to open areas and time to do anything with an actual telescope. So I subscribed to https://telescope.live from where I can choose which object I want to observe and get high quality raw data from telescopes. Post processing refers to the activity of turning ‘almost black’ looking raw telescope images to colorful images like the above ones.

There are many tools available for post processing, my favourites are GIMP and G’MIC (GREYC’s Magic for Image Computing), because they are open source and have lots of great features.

One big challenge which I faced when writing my star removal tool was the lack of training data. I simply used two freely available images, one for background and one for star mask, to create hundreds of fake training images. I am still working on improving my training data generation logic and training process, but the results look promising even with what I have today. Here is an image showing couple of training data images:

Samples from training data

The ‘Expected’ images were generated by superimposing random crops from a starless image and star mask (with transparency) on each other. The training code is available as a python notebook on Kaggle : https://www.kaggle.com/finalepoch/star-removal-from-astronomical-images-with-pix2pix

I have also been able to dockerize my tool so that you should be able to run it on any platform supporting docker. You can simply run the tool using the below command

docker run   -v $PWD:/usr/src/app/starreduction/data  \
-it code2k13/starreduction \
/bin/bash -c "./removestars.py ./data/example.jpg ./data/example_starless.jpg"

$PWD refers to your current working directory. In the above example it is assumed that the file example.jpg resides in your current working directory. This directory is mounted as a volume with the path /usr/src/app/starreduction/data inside the docker container. The output image example_starless.jpg will also be written to same directory.

If you are into astrophotograpy, do checkout my tool and share your feedback. The source code for application, training script and weights of models are available for free on github.

Visualising languages of tweets in real time using nlphose and C3.js

image of visualization
In this article I will show you how to perform language identification on tweets and how to stream the results to a webpage and display a real time visualization. We will use my open source project nlphose and C3.js to create this visualization in minutes and without writing any Python code !

To run this example you need following software

  • Docker
  • ngrok (optional, not required if your OS has GUI)
  • Internet Browser

Starting the nlphose docker container

Run the below code at shell/command prompt. It pulls the latest nlphose docker image from Docker hub. After running the command, it should start ‘bash’ inside the container

docker run --rm -it -p 3000:3000 code2k13/nlphose:latest

Running the nlphose pipeline inside the container

Copy paste the below command inside the container’s shell prompt. It will start the nlphose pipeline

twint -s "netflix" |\
./twint2json.py |\
./lang.py |\
jq -c '[.id,.lang]' |\

The above code collects tweets from twitter containing the term *”netflix”* using the ‘twint’ command and performs language identification on it. Then it streams the output using a socket.io server. For more details, please refer to wiki of my project. You can also create these commands graphically, as shown below using the NlpHose Pipeline Builder tool

Exposing local port 3000 on the internet (optional)

If you are running this pipeline on a headless server (no browser), you can expose port 3000 of your host machine over the internet using ngrok.

./ngrok http 3000

Running the demo

Download this html file from my GitHub repo. Edit the file and update the following line in the file with ngrok url

var endpointUrl = "https://your_ngrok_url"

If you are not using ngrok, and have browser installed on your system (which is running the docker container), simply change the line to:

var endpointUrl = "http://localhost:3000"

You will need to run this file from a local webserver (example http-server or python -m http.server 8080).Once you run it, you should see a webpage like the one shown below:

image of visualization

That’s it !! Hope you enjoyed this article !

Create NLP pipelines using drag and drop

✨Checkout the live demo here !

Recently I completed work on a tool called nlphoseGUI that allows creation of complex NLP pipelines visually, without writing a single line of code ! It uses Blockly to enable creation of NLP pipelines using drag and drop.

Currently following operations are supported:

  • Sentiment Analysis (AFINN)
  • NER (Spacy)
  • Language Identification (FastText)
  • Chunking (NLTK)
  • Sentiment Analysis (Transformers)
  • Question Answering (Transformers)
  • Zero shot Classification (Transformers)

The tool generates a nlphose command that can be executed in a docker container to run the pipeline. These pipelines can process streaming text like tweets or static data like files. They can be executed just like normal shell command using nlphose. Let me show you what I mean !

Below is pipeline that searches Twitter for tweets containing ‘netflix’ and performs named entity recognition on it.

It generates a nlphose command which looks like this

twint -s netflix |\ 
./twint2json.py |\
./entity |\

When the above pipeline is run using nlphose, you can expect to see stream of JSON output similar to the one shown below:

"id": "6a5fe972-e2e6-11eb-9efa-42b45ace4426",
"text": "Wickham were returned, and to lament over his absence from the Netherfield ball. He joined them on their entering the town, and attended them to their aunt’s where his regret and vexation, and the concern of everybody, was well talked over. To Elizabeth, however, he voluntarily acknowledged that the necessity of his absence _had_ been self-imposed.",
"afinn_score": -1.0,
"entities": [
"label": "PERSON",
"entity": "Wickham"
"label": "ORG",
"entity": "Netherfield"
"label": "PERSON",
"entity": "Elizabeth"

Lets try out something more, the below pipeline searches for tweets containing the word ‘rainfall’ and then finds the location where it rained using ‘extractive question answering’. It also filters out answers with lower scores.

Here is the nlphose command it generates:

twint -s rainfall |\ 
./twint2json.py |\
./xformer.py --pipeline question-answering --param 'where did it rain' |\
jq 'if (.xfrmr_question_answering.score) > 0.80 then . else empty end'

It is also possible to create a pipeline that processes multiple files from a folder :

The above pipeline generates this command:

./files2json.py -n 3  data/*.txt |\ 
./xformer.py --pipeline question-answering --param 'who gave the speech ?' |\
jq 'if (.xfrmr_question_answering.score) > 0.80 then . else empty end'

Play with the tool here: https://ashishware.com/static/nlphose.html

Here is the link to the projects git repository: https://github.com/code2k13/nlphoseGUI

Here is a YouTube link of the tool in action:

Don’t forget to checkout the repository of the companion project nlphose: https://github.com/code2k13/nlphose