Few months ago I started work on my open source project called nlphose. This project attempts to create command line utilities which can be piped together to perform various NLP tasks. When combined with a streaming source of text , such as twint or logs, you can create a complex pipeline that can perform various tasks on the strings like sentiment detection, entity resolution , language identification, chunking etc.
The project aims at using the basic shell concept of using the pipe command to process and feed output of one task to another. The command line scripts that make up the project expect single line JSON as input which contains text to be processed and the output from earlier processing in the pipeline. Every command line script simply enriches this JSON by adding different attributes. The scripts themselves are simple Python programs that use various NLP libraries and hence the whole system is easily extensible.
Here are some examples of what can be achieved
Get works of art (TV shows or movie names) in positive tweets containing term netflix :
twint -s netflix | ./twint2json.py | ./senti.py | ./entity.py | jq 'if (.afinn_score) > 5 then .entities|.[]| select(.label == "WORK_OF_ART") | .entity else empty end' |
Get tweet and people mentioned in postive tweets about premierleague :
twint -s premierleague | ./twint2json.py | ./senti.py | ./entity.py | jq ' if (.afinn_score) > 5 then . as $parent | .entities|.[]| select((.label == "PERSON") and .entity != "Netflix") | [$parent.text,.entity] else empty end' |
There is also a way to monitor the speed of the processing using a tool called ‘pv’. You can see the number of tweets incoming and being processed per seconds as shown below :
For more details please visit the project’s github page here: https://github.com/code2k13/nlphose