Ardis AI’s NER tool works by detecting the entities and their relationships in a text passage, and by comparing those entities to nodes in a knowledge graph. This comparison process lets the tool discover similar entities that it has encountered before. The tool uses those similar entities to help determine which label to apply to an entity in the text.
The NER tool comes with two pre-built labeling schemes: a "Classic NER" scheme with 12 categories, and a "Detailed" scheme with 97 categories. Our Classic NER scheme uses the following categories:
person, organization, location, geopolitical, facility, event, product, law, art, norp (nationality, other, religion, political), and miscellaneous (which includes languages). For the categories other than miscellaneous, our category definitions are very similar to those used by OntoNotes. Our Detailed scheme is partly inspired by the 112-category FIGER (Fine-Grained Entity Recognition) scheme, but we omit many of their labels, and the taxonomy that results from the process described below differs in several ways from the FIGER scheme.
Users aren't limited to the two pre-defined labeling schemes: On our customization dashboard, users have complete flexibility to define their own entity categories by building a label set and applying labels to entities from our data set. Ardis AI's customization tool strategically selects entities for users to label in order maximize the information it gains from each label applied. As a result, it's often possible to customize an NER model by labeling as few as 3-5 entities for each category. The complexity of the labeling scheme and the extent to which custom category boundaries agree or disagree with the entity similarities detected by the process described below will both affect the amount of labeling a user needs to perform to achieve good results.
The knowledge graph for the entity classification tool on our website was generated by an automated pipeline in which Ardis AI software processed Wikipedia articles and CNN articles (from the CNN/DailyNews corpus). For each article, our software built a simulation representing the entities, relationships, and events discussed. Then, it transformed this simulation into a graph encoding both the relationships mentioned in the text and those that resulted from events described in the text. The software then merged the graphs from each article’s simulation into a knowledge graph for the entire corpus.
Using this knowledge graph representing entities and their relationships, our software generated graph-based word embeddings for the entities (nodes) in the graph. It used those embeddings to cluster similar entities together, and then to transform those clusters into a hierarchical taxonomy that underlies a non-technical user’s ability to quickly and easily define a custom NER scheme by labeling entities using our customization dashboard.