Visualizing and analyzing website contents using AI

image of visualization

Sometime ago I was wondering if it was possible to create a bird’s eye view of a website. A view, that will showcase contents of a website visually and group similar pages/posts. Also I thought it would be cool if I could navigate the website using this view. I could discover more things faster.

Recently I ended up creating ‘Feed Visualizer‘. Feed Visualizer is a tool that can cluster RSS/Atom feed items based on semantic similarity and generate interactive visualization. This tool can be used to generate ‘semantic summary’ of any website by reading it’s RSS/Atom feed. Here is the link to project’s homepage https://github.com/code2k13/feed-visualizer

Here are links to couple of cool interactive visualizations I created using this tool:

Checkout the tool and generate some cool visualizations yourself. If you like this tool please consider giving a ⭐ on github !

Simpler and safer method of diatom collection and observation using a Foldscope (or any optical microscope)


Diatoms from a single slide, collected and observed using the method described in this article. The sample was not known to contain diatoms it was collected for studying algae.

🔬This article was originally written as a guide for Foldscope users, but it should work well for all types of optical microscopes.

👏This work was possible due to lot of support and guidance from Karthick Balasubramanian of the D3 LaB. He is a expert on diatoms and has provided valuable advice for carrying these experiments

📌 Safety Considerations

Even though the experiments mentioned in the below article use household chemicals like laundry bleach and adhesives, these can be harmful if used inappropriately. Please read all safety considerations marked by ‘⚠️’ at important places in the article. Proceed with the experiments ONLY IF YOU UNDERSTAND ALL THE SAFETY CONSIDERATIONS. Adult supervision is recommended.

Diatoms are some of the most interesting things you can see under a foldscope/microscope. They generate about 20-50% oxygen on earth every year.
They are also known as “jewels of the sea” and “living opals”. It is believed that there are around 200000 species of diatoms. Identification of diatoms can be done by observing the shape and marks on their ‘frustules’. A frustule is the siliceous part of a diatom cell wall. Under a microscope they look like different shaped glass beads with various engravings on them.

To observe these frustules, we have to ‘clean’ the diatoms. We need to remove pigments and matter inside the diatom cell, so that we observe these ‘frustules’ under a microscope and identify the species of diatoms. There are well established laboratory methods for doing this. These involve use of hazardous chemicals , acids and heat. These methods are not suitable for use in a school or at home. This article explains a simpler and much safer method for cleaning and mounting diatoms.

What you will need

You will need the following items:

  • Centrifuge Tubes (easily available online)
  • Pipettes (easily available online)
  • Fevi kwik (transparent general purpose instant adhesive)
  • Vanish (or similar laundry bleach/detergent additive)
  • Piece of strong thread
  • Glass slides and coverslips
  • Water sample containing diatoms
  • Patience

Collecting Diatoms

Diatoms are present in abundance in most saltwater and freshwater bodies. For the purpose of this experiment I collected a small sample of water from an artificial pond . The pond was filled with algae, and I wasn’t aware if it had any diatoms in it. You can try collecting water from different places and check it for diatoms. You can also follow instructions mentioned in this YouTube video: Diatoms. Part 1: Introduction and collection of diatoms – YouTube

Cleaning Diatoms

This is the most important and time consuming part of the process. Cleaning helps remove matter inside the diatom cell and separate diatoms from other debris. Labs use acids and other chemicals for this process. We will use a laundry bleach (Vanish brand available in India) to do the same. This particular item is sold as a detergent additive and claims to have 10x more oxidizing power compared to a normal detergent (just what we want !).

⚠️ Adult supervision is recommended. Be very careful when working with bleach of any kind. Wear gloves if you are not sure how strong the bleach is. Avoid spilling bleach on your skin and other body parts. Read information on the packet/box before using. Consult an expert if you are unsure.

Here are the steps for cleaning the diatoms:

STEP 1

Take about 30ml of water and mix 1/4 teaspoon of the detergent additive (Vanish) in it, in an OPEN container. Make sure it is fully dissolved. The quantity of water and amount of bleach would depend on the strength of your bleach.

⚠️ DO NOT keep this solution in a closed or airtight container, because gases can build up pressure inside the container.


Oxidation of the sample causes generation of bubbles and release of gas. Keep the tube open.

STEP 2

Take a few drops of water from the sample you have collected using a pipette in a centrifuge tube. Try to take few drops from the bottom of the sample. It is fine if there is some algae and debris in it.

STEP 3

Add 10 – 15 drops of the bleach solution we created in STEP 1 and let the solution settle for some time, say 60 minutes.

⚠️ DO NOT close the cap on this centrifuge tube. You should see air bubbles inside the tube as the sample gets oxidised.

STEP 4

Once there are no more air bubbles or their generation has slowed down considerably, we need to centrifuge the mixture. Before that we should remove all the excess liquid very carefully from the tube using a pipette. Just leave behind the settled in debris at the bottom and only a 0.5 cm deep layer of water on top of it. Once we have gotten rid of most of the bleach, we should fill half of the tube with clean water.


DIY centrifuge !!

⚠️ Excess gas buildup in closed tube can be dangerous. We are getting rid of most of the bleach before centrifuging the mixture. This should ensure that gas build up during the centrifugation process is minimal and mitigates any risk.

STEP 5

Screw in the cap on the centrifuge tube. Tie a 30-40 cm long thread tightly just below the cap as shown in the below figure.

STEP 6

Hold one end of the thread in your hand and start moving the tube in circles in a horizontal plane as shown in below the figure for a minute.


Operating the DIY centrifuge !!

⚠️ Make sure the thread is strong and tightly tied around the base of the centrifuge cap. Also ensure that you do this in open space to avoid hurting anyone or hitting anything.

STEP 7

Remove the cap of the centrifuge tube carefully and put it back on after a few seconds to let any gases escape. Keep it standing still on its pointed end for about 10 minutes by resting it vertically against a wall. You should see matter settling down at the bottom of the tube.

STEP 8

At this stage remove the cap of the centrifuge tube again and very carefully remove excess water using a pipette while leaving the matter at the base of the tube intact. I generally keep a thin layer of water (0.5 cm) above the settled debris. Now fill the tube carefully with clean water so that it is half full. Repeat steps 6,7 and 8 two more times to wash out any bleach or soap remaining inside the tube.

STEP 9

Remove all excess water , keeping a thin layer of water (say 0.5 cm) above the settled down matter. Keep the tube in a vertical position all the time, do not invert the tube.

Congratulations !! You have cleaned your first batch of diatoms. It is now time to mount them on a slide.

Creating the slide

Once the mixture in the tube has settled down, it is time to create a slide. You can create a slide from this sample without using any mounting medium. But I would highly recommend using a mounting medium for two reasons:

In the Foldscope (as opposed to conventional microscopes), slides are held vertically when viewing.The coverslip and slide are pressed hard together by two magnets on both sides. This can put a lot of pressure on cover slip, especially during focusing. Diatom frustules are fragile and may break.

There are many high refractive index mounting mediums which are used for diatoms like Naphrax, Zrax, Pleurax and Hyrax , which may be hard to find (depending on where you live) and expensive. Also heating is involved when using them and toxic fumes are released on heating them. I tried to use Pleurax and heated the slide using a candle. The medium did not harden and my slide was sticky and a total mess to work with. This is why I recommend using ‘Fevi kwick’ or any of those general purpose transparent liquid instant adhesives (these go by different brand names). These are cheap and readily available. But there is one problem, they harden very fast, in a matter of a couple of seconds. You have to be very precise with your slide placement in the first attempt itself. You won’t be able to adjust the slide after it makes contact with the adhesive.

⚠️ Please read information on the adhesive packet about safety and usage. Wear gloves if required. These things are super sticky and are not easily removable if they come in contact with your skin or fingers.

To mount the diatom specimen follow these steps:

STEP 1

Using a pipette, take only a couple of drops of water from the very end of centrifuge tube containing the cleaned diatoms. This can be tricky. If you take in more water , hold the pipette still in a vertical position for a couple of minutes.

STEP 2

Now carefully place a drop of this water from the pipette onto the centre of an empty glass slide. Spread the drop slightly using the same pipette so as to cover area equal to that of a cover slip.

STEP 3

Let the slide dry. This can take a few minutes. You should see white powdery residue on the slide. This residue contains our diatoms.

✔️ If you don’t want to work with adhesives , you can skip steps 4,5 and 6. Instead just place a cover slip on the white powdery portion, secure the sides with transparent tape and move to step 7.

STEP 4

Now very carefully place a drop of adhesive on top of the powdery spot on the slide.

STEP 5

Hold the coverslip using tongs (recommended) or your fingernails (wearing gloves recommended) and place it on the adhesive drop. Immediately the tap center of the coverslip gently using a blunt object.

STEP 6

The coverslip will stick to the slide within a couple of seconds. If you made a mistake you will have to start over with new a slide and coverslip. (Always save some sample for such scenarios).

STEP 7

Congratulations !! You have created a slide out of your diatom sample.


Permanent slide using transparent instant adhesive.

Viewing the Slide and Taking Pictures

If you plan to take pictures of these diatoms, make sure you have the LED Magnifier attachment for Foldscope.
LED Magnifier Kit – (Contains 20 LED/Magnifier Lights. This does not i – Foldscope Instruments, Inc. (teamfoldscope.myshopify.com)

Proper lighting and focus is very important for capturing the details of diatoms. From my personal experience I have found some mobile phone cameras are better at focusing when used with Foldscope compared to others. Take lots of pictures and upload the best ones here on microcosmos. Below are some more images taken from a single slide created using the technique explained in this article.




Hope this information was useful to you all. Please share your comments and suggestions. Happy diatom hunting !!

References

Diatoms. Part 1: Introduction and collection of diatoms – YouTube

Diatoms. Part 2: Preparation of permanent slide) – YouTube

Microjewels. – Microcosmos (foldscope.com)

Creating scalable NLP pipelines using PySpark and Nlphose

In this article we will see how we can use Nlphose along with Pyspark to execute a NLP pipeline and gather information about the famous journey from Jules Verne’s book ‘Around the World in 80 days‘. Here is the link to the ⬇️ Pyspark notebook used in this article.


From my personal experience what I have found is data mining from unstructured data requires use of multiple techniques. There is no single model or library that typically offers everything you need. Often you may need to use components written in different programing languages/frameworks. This is where my open source project Nlphose comes into picture. Nlphose enables creation of complex NLP pipelines in seconds, for processing static files or streaming text, using a set of simple command line tools.You can perform multiple operation on text like NER, Sentiment Analysis, Chunking, Language Identification, Q&A, 0-shot Classification and more by executing a single command in the terminal. Spark is widely used big data processing tool, which can be used to parallelize workloads.

Nlphose is based on the ‘Unix tools philosophy‘. The idea is to create simple tools which can work together to accomplish multiple tasks. Nlphose scripts rely on standard ‘Filestreams’ to read and write data and can be piped together to create complex natural language processing pipelines. Every scripts reads JSON from ‘Standard Input’ and writes to ‘Standard Output’. All the script expect JSON to be encoded in nlphose compliant format and the output also needs to be nlphose compliant. You can read more details about the architecture of Nlphose here.

How is Nlphose different ?

Nlphose by design does not support Spark/Pyspark natively like SparkML. Nlphose in Pyspark relies on ‘Docker’ being installed on all nodes of the Spark cluster. This is easy to do when creating a new cluster in Google Dataproc (and should be same for any other Spark distribution). Apart from Docker Nlphose does not require any other dependency to be installed on worker nodes of the Spark cluster. We use the ‘PipeLineExecutor’ class as described below to execute Nlphose pipeline from within Pyspark.

Under the hood, this class uses the ‘subprocess‘ module to spawn a new docker container for a Spark task and executes a Nlphose pipeline. I/O is performed using stdout and stdin. This sounds very unsophisticated but that’s how I build Nlphose. It was never envisioned to support any specific computing framework or library. You can run Nlphose in many different ways, one of which is being described here.

Let’s begin

First we install a package which we will use later to construct a visualization.

!pip install wordcount

The below command downloads the ebook ‘Around the world in 80 days’ from gutenber.org and uses a utility provided by Nlphose to segment the single text file into line delimited json. Each json object has an ‘id’ and ‘text’ attribute.

!docker run code2k13/nlphose:latest \
/bin/bash -c "wget https://www.gutenberg.org/files/103/103-0.txt && ./file2json.py 103-0.txt -n 2" > ebook.json

We delete all docker containers which we no longer need.

!docker system prune -f

Let’s import all the required libraries. Read the json file which we created earlier and convert it into a pandas data frame. Then we append a ‘group_id’ column which assigns a random groupId from 0 to 3, to the rows. Once that is done we create a new PySpark dataframe and display some results.

import pandas as pd
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import udf

df_pd = pd.read_json("ebook.json",lines=True)
df_pd['group_id'] = [i for i in range(0,3)]*347
df= spark.createDataFrame(df_pd)
df= spark.createDataFrame(df_pd)
+---------+--------------------+--------------------+--------+
|file_name| id| text|group_id|
+---------+--------------------+--------------------+--------+
|103-0.txt|2bbbfe64-7c1e-11e...|The Project Gute...| 0|
|103-0.txt|2bbea7ea-7c1e-11e...| Title: Around th...| 1|
|103-0.txt|2bbf2eb8-7c1e-11e...|IN WHICH PHILEAS ...| 2|
|103-0.txt|2bbfdbd8-7c1e-11e...| Certainly an Eng...| 0|
|103-0.txt|2bbff29e-7c1e-11e...| Phileas Fogg was...| 1|
|103-0.txt|2bc00734-7c1e-11e...| The way in which...| 2|
|103-0.txt|2bc02570-7c1e-11e...| He was recommend...| 0|
|103-0.txt|2bc095f0-7c1e-11e...| Was Phileas Fogg...| 1|
|103-0.txt|2bc0ed20-7c1e-11e...| Had he travelled...| 2|
|103-0.txt|2bc159d6-7c1e-11e...| It was at least ...| 0|
|103-0.txt|2bc1a3be-7c1e-11e...| Phileas Fogg was...| 1|
|103-0.txt|2bc2a2aa-7c1e-11e...|He breakfasted an...| 2|
|103-0.txt|2bc2c280-7c1e-11e...| If to live in th...| 0|
|103-0.txt|2bc30b3c-7c1e-11e...| The mansion in S...| 1|
|103-0.txt|2bc34dd6-7c1e-11e...| Phileas Fogg was...| 2|
|103-0.txt|2bc35f88-7c1e-11e...|Fogg would, accor...| 0|
|103-0.txt|2bc3772a-7c1e-11e...| A rap at this mo...| 1|
|103-0.txt|2bc3818e-7c1e-11e...| “The new servant...| 2|
|103-0.txt|2bc38e0e-7c1e-11e...| A young man of t...| 0|
|103-0.txt|2bc45c6c-7c1e-11e...| “You are a Frenc...| 1|
+---------+--------------------+--------------------+--------+

Running the nlphose pipeline using Pyspark

As discussed earlier Nlphose does not have native integration with Pyspark/Spark. So we create a class called ‘PipeLineExecutor’ that starts a docker container and executes a Nlphose command. This class communicates with the docker container using ‘stdin’ and ‘stdout’. Finally when the docker container completes execution, we execute ‘docker system prune -f’ to clear any unused containers. The ‘execute_pipeline’ method writes data from dataframe to stdin (line by line), reads output from stdout and returns a dataframe created from the output.

import subprocess
import pandas as pd
import json

class PipeLineExecutor:
def __init__(self, nlphose_command,data,id_column='id',text_column='text'):
self.nlphose_command = nlphose_command
self.id_column = id_column
self.text_column = text_column
self.data = data

def execute_pipeline(self):
try:
prune_proc = subprocess.Popen(["docker system prune -f"],shell=True)
prune_proc.communicate()

proc = subprocess.Popen([self.nlphose_command],shell=True,stdout=subprocess.PIPE, stdin=subprocess.PIPE,stderr=subprocess.PIPE)
for idx,row in self.data.iterrows():
proc.stdin.write(bytes(json.dumps({"id":row[self.id_column],"text":row[self.text_column]}),"utf8"))
proc.stdin.write(b"\n")
proc.stdin.flush()

output,error = proc.communicate()
output_str = str(output,'utf-8')
output_str = output_str
data = output_str.split("\n")
data = [d for d in data if len(d) > 2]
finally:
prune_proc = subprocess.Popen(["docker system prune -f"],shell=True)
prune_proc.communicate()
return pd.DataFrame(data)

The Nlphose command

The below command does multiple things:

  • It starts a docker container using code2k13/nlphose:latest image from dockerhub.
  • It redirects stdin, stdout and stderr of host into docker container.
  • Then it runs a nlphose command inside the docker container which performs below operations on json coming from ‘stdin’ and writes output to ‘stdout’:
command =  '''
docker run -a stdin -a stdout -a stderr -i code2k13/nlphose:latest /bin/bash -c "./entity.py |\
./xformer.py --pipeline question-answering --param 'what did they carry?'
"
'''

The below function formats the data returned by PipelineExecutor task. The dataframe returned by ‘PipelineExecutor.execute_pipeline’ has a string column containing output from the Nlphose command. Each row in the dataframe represents a line/document output from the Nlphose command.

def get_answer(row):
try:
x = json.loads(row[0],strict=False)
row['json_obj'] = json.dumps(x)
if x['xfrmr_question_answering']['score'] > 0.80:
row['id'] = str(x['id'])
row['answer'] = x['xfrmr_question_answering']['answer']
else:
row['id'] = str(x['id'])
row['answer'] = None

except Exception as e:
row['id'] = None
row['answer'] = "ERROR " + str(e) #.message
row['json_obj'] = None

return row

The below function, creates a ‘PipeLineExecutor’ object, passes on data to it and then calls the ‘execute_pipeline’ method on the object. Then it uses the ‘get_answer’ method to format the output of the ‘execute_pipeline’ method.

def run_pipeline(data):
nlphose_executor = PipeLineExecutor(command,data,"id","text")
result = nlphose_executor.execute_pipeline()
result = result.apply(get_answer,axis=1)
return result[["id","answer","json_obj"]]

Scaling the pipeline using PySpark

We use the ‘applyInPandas‘ from PySpark to parallelize and process text at scale. PySpark automatically handles scaling of the Nlphose pipeline on the Spark cluster. The ‘run_pipeline’ method is invoked for every ‘group’ of the input data. It is important to have appropriate number of groups based on number of nodes so as to efficiently process data on the Spark cluster.

output = df.groupby("group_id").applyInPandas(run_pipeline, schema="id string,answer string,json_obj string")
output.cache()

Visualizing our findings

Once we are done executing the nlphose pipeline, we set out to visualize our findings. I have created two visualizations:

  • A map showing places mentioned in the book.
  • A word cloud of all the important items the characters carried for their journey.

Plotting most common locations from the book on world map

The below code extracts latitude and longitude information from Nlphose pipeline and creates a list of most common locations.

💡 Note: Nlphose entity extraction will automatically guess coordinates for well known locations using dictionary based approach

 def get_latlon2(data):
json_obj = json.loads(data)
if 'entities' in json_obj.keys():
for e in json_obj['entities']:
if e['label'] == 'GPE' and 'cords' in e.keys():
return json.dumps({'data':[e['entity'],e['cords']['lat'],e['cords']['lon']]})
return None

get_latlon_udf2 = udf(get_latlon2)
df_locations = output.withColumn("locations",get_latlon_udf2(output["json_obj"]))
top_locations = df_locations.filter("`locations` != 'null'").groupby("locations").count().sort(desc("count")).filter("`count` >= 1")
top_locations.cache()
top_locations.show()

Then we use ‘geopandas’ package to plot these locations on a world map. Before we do that we will have to transform our dataframe to the format understood by ‘geopandas’. This is done by applying the function ‘add_lat_long’

def add_lat_long(row):
obj = json.loads(row[0])["data"]
row["lat"] = obj[1]
row["lon"] = obj[2]
return row
import geopandas

df_locations = top_locations.toPandas()
df_locations = df_locations.apply(add_lat_long,axis=1)

gdf = geopandas.GeoDataFrame(df_locations, geometry=geopandas.points_from_xy(df_locations.lon, df_locations.lat))
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
ax = world.plot(color=(25/255,211/255,243/255) ,edgecolor=(25/255,211/255,243/255),
linewidth=0.4,edgecolors='none',figsize=(15, 15))
ax.axis('off')
gdf.plot(ax=ax,alpha=0.5,marker=".",markersize=df_locations['count']*100,color='seagreen')

If you are familiar with this book, you will realize we have almost plotted the actual route taken by Fogg in his famous journey.

For reference, shown below is the image of actual route taken by him from wikipedia

Around the World in Eighty Days map

Roke, CC BY-SA 3.0, via Wikimedia Commons

Creating a word cloud of items carried by Fogg’s on his Journey

The below code finds the most common items carried by the travellers using ‘extractive question answering’ and creates a word cloud.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

figure(figsize=(12, 6), dpi=120)
wordcloud = WordCloud(background_color='white',width=1024,height=500).generate(' '.join(output.filter("`answer` != 'null'").toPandas()['answer'].tolist()))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Conclusion

One of the reasons why PySpark is a favorite tool of data scientists and ML practitioners is because working with dataframes is very convenient. This article shows how we can run Nlphose on a Spark cluster using PySpark. Using the approach described in this article we can embed Nlphose pipeline as part of our data processing pipelines very easily. Hope you liked this article. Feedback and comments are always welcomed, thank you !

Creating scalable NLP pipelines with Nlphose, Kafka and Kubernetes


A typical Natural Language Processing(NLP) pipeline contains many inference tasks by employing models created in different programing languages and frameworks. Today there are many established ways to deploy and scale inference mechanisms based on a single ML model. But scaling a pipeline is not that simple. In this article we will see how we can create a scalable NLP pipeline that can theoretically processes thousands of text messages in parallel using my open source project Nlphose, Kafka and Kubernetes.

Introduction to nlphose

Nlphose enables creation of complex NLP pipelines in seconds, for processing static files or streaming text, using a set of simple command line tools. No programing needed! You can perform multiple operation on text like NER, Sentiment Analysis, Chunking, Language Identification, Q&A, 0-shot Classification and more by executing a single command in the terminal. You can read more about this project here https://github.com/code2k13/nlphose. Nlphose also has a GUI query builder tool that allows you to create complex NLP pipelines via drag and drop, right inside your browser. You can checkout the tool here : https://ashishware.com/static/nlphose.html

Create a nlphose pipeline

Recently I added two components to my project:

  • Kafka2json.py - This tool is similar to kafkacat. It can listen to a Kafka topic and stream text data to the rest of the pipeline.
  • Kafkasink.py - This tool acts like a ‘sink’ in the NLP pipeline and writes the output of the pipeline to a Kafka topic.

In this article we will use the graphical query builder tool to create a complex NLP pipeline which looks like this :

The pipeline does following things:

  • Listen to a topic called ‘nlphose’ on Kafka
  • Named Entity Recognition (automatic geo-tagging of locations)
  • Language identification
  • Searches for answer of the question “How many planets are there?”, using extractive question answering technique.
  • Writes output of the pipeline to a Kafka topic ‘nlpoutput’

This is what the Nlphose command for the above pipeline looks like :

./kafka2json.py -topic nlphose -endpoint localhost:9092 -groupId grp1 |\
./entity.py |\
./lang.py |\
./xformer.py --pipeline question-answering --param 'How many planets are there?' |\
./kafkasink.py -topic nlpoutput -endpoint localhost:9092

Configuring Kafka

This article assumes you already have a Kafka instance and have created two topics ‘nlphose’ and ‘nlpoutput’ on it. If you do not have access to a Kafka instance, please refer to this quick start guide to setup Kafka on your development machine. The Kafka instance should be accessible to all the machines or Kubernetes cluster which are running the Nlphose pipeline. In this example, the ‘nlphose’ topic had 16 partitions. We will use multiple consumers in a single group to subscribe to this topic and process messages concurrently.

Deploying the Nlphose pipeline on Kubernetes

Nlphose comes with a docker image containing all the code and models required for it to function. You will simply have to create a deployment in Kubernetes to run this pipeline. Given below is a sample deployment file:

apiVersion: apps/v1
kind: Deployment
metadata:
name: nlphose
labels:
app: nlphose
spec:
replicas: 6
selector:
matchLabels:
app: nlphose
template:
metadata:
labels:
app: nlphose
spec:
containers:
- name: nlphose
image: code2k13/nlphose:latest
command: ["/bin/sh"]
#args: ["-c", "while true; do echo Done Deploying sv-premier; sleep 3600;done"]
#command: [" "]
#args: ["pwd "]
args: ["-c","cd /usr/src/app/nlphose/scripts && ./kafka2json.py -topic nlphose -endpoint 127.0.0.1:9092 -groupId grp1 | ./entity.py | ./lang.py | ./xformer.py --pipeline question-answering --param 'How many planets are there ?' | ./kafkasink.py -topic nlpoutput -endpoint 127.0.0.1:9092"]

Please ensure that you change ‘127.0.0.1:9092’ to correct IP and port of your Kafka instance.

Deploying the Nlphose pipeline without Kubernetes

Kubernetes is not mandatory to create scalable pipelines with Nlphose. As a Nlphose pipeline is just a set of shell commands, you can run these on multiple bare metal machines or VMs or PaaS solutions. When used with Kafka plugins, Nlphose is able to concurrently processes Kafka messages, so long as the same groupId is used on all computers running the pipeline. Simply copy paste output from the GUI query builder and paste it in the terminal of the target machines. You have a cluster of computers executing the pipeline concurrently !

Benchmarking and results

I create a small program that continuously posted below message on Kafka topic ‘nlphose’:

message = {
"id": "uuid",
"text": "The solar system has nine planet. Earth is the only plant with life.Jupiter is the largest planet. Pluto is the smallest planet."
}

The corresponding processed message generated by the pipeline looks like this:

{
"id": "79bde3c1-3e43-46c0-be00-d730cd240d5a",
"text": "The solar system has nine planets. Earth is the only plant with life.Jupiter is the largest planet. Pluto is the smallest planet.",
"entities": [
{
"label": "CARDINAL",
"entity": "nine"
},
{
"label": "LOC",
"entity": "Earth"
},
{
"label": "LOC",
"entity": "Jupiter"
}
],
"lang": "en",
"xfrmr_question_answering": {
"score": 0.89333176612854,
"start": 21,
"end": 25,
"answer": "nine"
}
}

The below graph shows results from experiments I performed on GKE and a self hosted Kafka instance. It can be clearly seem from the below chart that the time required to process request goes down with addition of more instances.

Conclusion

With the addition of new Kafka plugins Nlphose has become a true big data tool. I am continuously working on making it more useful and easy to operate for end users. Nlphose can work with Kubernetes, Docker or bare metal. Also, you can customize or add new tools to these pipelines easily if you stick to the guiding principles of the project. Hope you found this article useful. For any suggestions or comments please reach me on Twitter or Linkedin.

Free and open source tool for star removal in astronomy images

Image of Andromeda galaxy on left, the same image after star removal (on right) using my tool ! Click to enlarge the image.
📥 Download the tool here: https://github.com/code2k13/starreduction

When you look at the images of galaxy and nebulae on the internet, you won’t realize that stars are generally absent or very dim. In reality that is not the case. There are tons of stars surrounding these objects. Special tools are used to remove these stars and make the images look pretty. Starnet is one of the popular tools for star removal available online. Being familiar with AI/ML I decided to write my own neural network based tool for star removal. You can download the tool here : https://github.com/code2k13/starreduction

Here are some more samples generated by my tool:
Original and starless image of area around WR 102
Original and starless image of Blue Horsehead Nebula

These days I have found a new hobby for myself “post processing astronomical” images. I have always been fond of astronomy. In fact I even own a small telescope, but I don’t have access to open areas and time to do anything with an actual telescope. So I subscribed to https://telescope.live from where I can choose which object I want to observe and get high quality raw data from telescopes. Post processing refers to the activity of turning ‘almost black’ looking raw telescope images to colorful images like the above ones.

There are many tools available for post processing, my favourites are GIMP and G’MIC (GREYC’s Magic for Image Computing), because they are open source and have lots of great features.

One big challenge which I faced when writing my star removal tool was the lack of training data. I simply used two freely available images, one for background and one for star mask, to create hundreds of fake training images. I am still working on improving my training data generation logic and training process, but the results look promising even with what I have today. Here is an image showing couple of training data images:

Samples from training data

The ‘Expected’ images were generated by superimposing random crops from a starless image and star mask (with transparency) on each other. The training code is available as a python notebook on Kaggle : https://www.kaggle.com/finalepoch/star-removal-from-astronomical-images-with-pix2pix

I have also been able to dockerize my tool so that you should be able to run it on any platform supporting docker. You can simply run the tool using the below command

docker run   -v $PWD:/usr/src/app/starreduction/data  \
-it code2k13/starreduction \
/bin/bash -c "./removestars.py ./data/example.jpg ./data/example_starless.jpg"

$PWD refers to your current working directory. In the above example it is assumed that the file example.jpg resides in your current working directory. This directory is mounted as a volume with the path /usr/src/app/starreduction/data inside the docker container. The output image example_starless.jpg will also be written to same directory.

If you are into astrophotograpy, do checkout my tool and share your feedback. The source code for application, training script and weights of models are available for free on github.

Visualising languages of tweets in real time using nlphose and C3.js

image of visualization
In this article I will show you how to perform language identification on tweets and how to stream the results to a webpage and display a real time visualization. We will use my open source project nlphose and C3.js to create this visualization in minutes and without writing any Python code !

To run this example you need following software

  • Docker
  • ngrok (optional, not required if your OS has GUI)
  • Internet Browser

Starting the nlphose docker container

Run the below code at shell/command prompt. It pulls the latest nlphose docker image from Docker hub. After running the command, it should start ‘bash’ inside the container

docker run --rm -it -p 3000:3000 code2k13/nlphose:latest

Running the nlphose pipeline inside the container

Copy paste the below command inside the container’s shell prompt. It will start the nlphose pipeline

twint -s "netflix" |\
./twint2json.py |\
./lang.py |\
jq -c '[.id,.lang]' |\
./ws.js

The above code collects tweets from twitter containing the term *”netflix”* using the ‘twint’ command and performs language identification on it. Then it streams the output using a socket.io server. For more details, please refer to wiki of my project. You can also create these commands graphically, as shown below using the NlpHose Pipeline Builder tool

Exposing local port 3000 on the internet (optional)

If you are running this pipeline on a headless server (no browser), you can expose port 3000 of your host machine over the internet using ngrok.

./ngrok http 3000

Running the demo

Download this html file from my GitHub repo. Edit the file and update the following line in the file with ngrok url

var endpointUrl = "https://your_ngrok_url"

If you are not using ngrok, and have browser installed on your system (which is running the docker container), simply change the line to:

var endpointUrl = "http://localhost:3000"

You will need to run this file from a local webserver (example http-server or python -m http.server 8080).Once you run it, you should see a webpage like the one shown below:

image of visualization

That’s it !! Hope you enjoyed this article !

Create NLP pipelines using drag and drop

✨Checkout the live demo here !

Recently I completed work on a tool called nlphoseGUI that allows creation of complex NLP pipelines visually, without writing a single line of code ! It uses Blockly to enable creation of NLP pipelines using drag and drop.

Currently following operations are supported:

  • Sentiment Analysis (AFINN)
  • NER (Spacy)
  • Language Identification (FastText)
  • Chunking (NLTK)
  • Sentiment Analysis (Transformers)
  • Question Answering (Transformers)
  • Zero shot Classification (Transformers)

The tool generates a nlphose command that can be executed in a docker container to run the pipeline. These pipelines can process streaming text like tweets or static data like files. They can be executed just like normal shell command using nlphose. Let me show you what I mean !

Below is pipeline that searches Twitter for tweets containing ‘netflix’ and performs named entity recognition on it.

It generates a nlphose command which looks like this

twint -s netflix |\ 
./twint2json.py |\
./entity |\
./senti

When the above pipeline is run using nlphose, you can expect to see stream of JSON output similar to the one shown below:

....
{
"id": "6a5fe972-e2e6-11eb-9efa-42b45ace4426",
"text": "Wickham were returned, and to lament over his absence from the Netherfield ball. He joined them on their entering the town, and attended them to their aunt’s where his regret and vexation, and the concern of everybody, was well talked over. To Elizabeth, however, he voluntarily acknowledged that the necessity of his absence _had_ been self-imposed.",
"afinn_score": -1.0,
"entities": [
{
"label": "PERSON",
"entity": "Wickham"
},
{
"label": "ORG",
"entity": "Netherfield"
},
{
"label": "PERSON",
"entity": "Elizabeth"
}
]
}
...

Lets try out something more, the below pipeline searches for tweets containing the word ‘rainfall’ and then finds the location where it rained using ‘extractive question answering’. It also filters out answers with lower scores.

Here is the nlphose command it generates:

twint -s rainfall |\ 
./twint2json.py |\
./xformer.py --pipeline question-answering --param 'where did it rain' |\
jq 'if (.xfrmr_question_answering.score) > 0.80 then . else empty end'

It is also possible to create a pipeline that processes multiple files from a folder :

The above pipeline generates this command:

./files2json.py -n 3  data/*.txt |\ 
./xformer.py --pipeline question-answering --param 'who gave the speech ?' |\
jq 'if (.xfrmr_question_answering.score) > 0.80 then . else empty end'

Play with the tool here: https://ashishware.com/static/nlphose.html

Here is the link to the projects git repository: https://github.com/code2k13/nlphoseGUI

Here is a YouTube link of the tool in action:

Don’t forget to checkout the repository of the companion project nlphose: https://github.com/code2k13/nlphose

Nlphose - Command Line Tools For NLP

Few months ago I started work on my open source project called nlphose. This project attempts to create command line utilities which can be piped together to perform various NLP tasks. When combined with a streaming source of text , such as twint or logs, you can create a complex pipeline that can perform various tasks on the strings like sentiment detection, entity resolution , language identification, chunking etc.

The project aims at using the basic shell concept of using the pipe command to process and feed output of one task to another. The command line scripts that make up the project expect single line JSON as input which contains text to be processed and the output from earlier processing in the pipeline. Every command line script simply enriches this JSON by adding different attributes. The scripts themselves are simple Python programs that use various NLP libraries and hence the whole system is easily extensible.

Here are some examples of what can be achieved

Get works of art (TV shows or movie names) in positive tweets containing term netflix :

twint -s netflix | ./twint2json.py | ./senti.py | ./entity.py | jq 'if (.afinn_score) > 5 then .entities|.[]| select(.label == "WORK_OF_ART") | .entity    else empty  end'

Get tweet and people mentioned in postive tweets about premierleague :

twint -s premierleague | ./twint2json.py | ./senti.py | ./entity.py | jq ' if (.afinn_score) > 5 then . as $parent | .entities|.[]| select((.label == "PERSON") and .entity != "Netflix") | [$parent.text,.entity]     else empty  end'

There is also a way to monitor the speed of the processing using a tool called ‘pv’. You can see the number of tweets incoming and being processed per seconds as shown below :

For more details please visit the project’s github page here: https://github.com/code2k13/nlphose

My first game using Phaser 3

Checkout the live demo here https://ashishware.com/static/game.html

Writing a game can be a very fun task if you are using a framework like [Phaser[(https://www.phaser.io/phaser3)]. I came across Phaser when I was looking for javascript libraries for game development. What I liked about Phaser is that is not just a physics engine, but has many features which make creating games using Javascript a breeze.

Another great discovery was KennyNL‘s website which has many game assets which have been released under CC0 1.0 Universal license by KenneyNL
The site includes hundreds of sprites, sounds and background images.

I also managed to create the game as a PWA (Progressive Web Application), which means that you can install it on most mobile devices:


Here is the link to project’s github page: https://github.com/code2k13/Jump2k21Game

Build Your Own Real Time Twitter Analytics System

This video shows how to create a real time twitter analytics system without writing any code.

Watch the above video to know how you can create your very own real time twitter analytics system in minutes without writing any code. You can create powerful custom dashboard to monitor data without writing any code.

Give below is screensot of a simple dashboard which displays the real time count of tweets about Netflix and Amazon Prime, side by side.
chart

Following tools were used for the above project:

  • twint
  • Twitter’s streaming API
  • Elastic Search
  • Logstash
  • Kibana