So, I have been working at this project for a while … mostly dragging it around with me. The basic idea is to data mine unique named entities from text based files. The initial plan (made back in 2012) involved using the Stanford NLP Parser‘s Named Entity Recognizer to mine data from a copious amount of files. But, I built a simple dictionary lookup based NER which matches text tokens/words to find unique named entities.
These named entities are then considered to be related to each other as they occur in the same file. (Initial plans included a degree measure defining the degree of connection as 3 for being in the same sentence, 2 for being in the same paragraph and 1 for being in the same file.) This interconnection is then visualized as an interactive graph (stored as csv files), where the nodes are each named entity and the edges are colored thick as per the frequency of the occurrence of the node pair in other files.
This interactive graph can then be used to click on the nodes of named entities to pull up a list of files where they occur and even a collated page of all the paragraphs they exist in from their respective files. The edges too would’ve been interactive to allow clicking to do the same (producing list of files/file-paths and a collated content page) where ever the two named entity (nodes) occur.
Build 1 (2012) [Visualization Only]
The initial project was scrapped due to our academic time constraints and I ended up building a csv search and visualizer tool using JUNG2.0. This would search for a root node (named entity) and draw a graph of it’s corresponding named entities. I called it, Nexus Grapher [Original Post Here].
Build 2 (2013-14) [Visualization Only]
This project (Nexus Grapher 2.0) was overhauled a little bit to showcase entire nexus of the interconnected named entities with three color schemes and unique shaped nodes to represent each data type.
Build 3 (2015-16) [NER Only]
I got some time in between my time for prepping for GRE, TOEFL and running around filling applications in the first half of the year 2015 and then juggling academics and assignments to actually sit and build a small version of the NER. This is the most current version of the NER.
Complete Design for 2016 [wip]
Here is a raggedy sketch of the entire Nexus NER System;
The entire system can be broken into four major components –
- Named Entity Recognizer (Miner) which is supported by spell checker.
- An interactive Visualization Engine/Component in D3 or JUNG or VI-JUNG.
- A Query and Nexus manipulation component in Java responsible for searching for sub-graphs as per queries and relationships and perform manipulations on it.
- A graph database, Neo4j, to hold all mined data into an ever growing knowledge base which can be also used to write, read and handle CSV and JSON versions to files.
Each of these components can be altered or swapped with other alternatives.
A lot of components can be swapped by better and tested libraries to improve the entire Nexus NER Grapher system. Some ideas are –
- Using Apache Tika (and Stanford NLP NER) to mine data.
- Using Cypher, the noSQL query language used in Neo4j databases for all graph manipulations.
- Adding features for GraphML and other versions of graph representation.
- Adding a web crawler to mine data from the web. Instead of just mining offline/on-device data files.