Named Entity Recognition for Idiosyncratic Web Collections

An outline of the paper: Effective Named Entity Recognition for Idiosyncratic Web Collections, by Roman Prokofyev et al. (WWW 2014)

Named Entity Recognition (NER) plays an important role in a variety of information management tasks on the Web, including text categorization, document clustering, or faceted search.

In this paper, we have developed a system to identify entities such as original technical concepts in scientific documents. The general system pipeline is shown on a picture below.

The system was evaluated on two test collections created from a set of Computer Science and Physics papers. Our experimental results have shown ~0.81 Precision and ~0.87 Recall on detecting named entities compared to 0.65/0.72 using state-of-the-art Maximum Entropy methods.

The key properties of our system that we think allowed us to outperform the state of art methods are the following:

  • Candidate named entity selection based on n-gram frequency statistics and n-gram merging techniques.
  • Candidate named entity classification with features based on external academic knowledge bases such as DBLP and semi-structured knowledge bases such as DBpedia.

The full paper is available online on our website:
Named Entity Recognition for Idiosyncratic Web Collections.