Skip to content
Simon Bedford edited this page Apr 28, 2017 · 10 revisions

This is a tool for classifying, tagging, analyzing and visualizing news articles about internal displacement. The aim is to build a tool that can populate a database with displacement events from online news articles, which can be both classified by a machine, and then verified and analyzed by a human.

Long-Term Vision

This project has been developed with a long-term vision for creating a widely applicable tool that can parse and classify web-based articles, and extract all reports that refer to displacement events and consequences. These reports are stored in a database and can then be accessed and plotted on maps (latitude and longitude coordinates are obtained for extracted locations).

Competition adaptation

This broader solution has been adapted to meet the specific competition submission and evaluation guidelines.

Resources

Requirements

The core scraping and processing functionality is written using Python 3.

For Natural Language Processing, we are using the Spacy library. In addition to installing the library, this also requires a language model to be downloaded (approx. 1GB of data).

Running in Docker

You can run everything as you're accustomed to by installing dependencies locally, but another option is to run in a Docker container. That way, all of the dependencies will be installed in a controlled, reproducible way.

  1. Install Docker: https://www.docker.com/products/overview
  2. Build the docker container (unfortunately, this will take a while):
docker build -t internal-displacement .

or

docker-compose -f docker-compose-spacy.yml up

The spacy version will include the en_core_web_md 1.2.1 NLP model.

It is multiple gigabytes in size. The one without the model is much smaller.

Either way, this will take some time the first time. It's fetching and building all of its dependencies. Subsequent runs should be much faster.

This will start up several docker containers, running postgres, a Jupyter notebook server, and the node.js front end.

In the output, you should see a line like:

jupyter_1  |         http://0.0.0.0:3323/?token=536690ac0b189168b95031769a989f689838d0df1008182c

That URL will connect you to the Jupyter notebook server.

  1. Visit the node.js server at http://localhost:3322

Note: You can stop the docker containers using Ctrl-C.

Note: If you already have something running on port 3322 or 3323, edit docker-compose.yml and change the first number in the ports config to a free port on your system. eg. for 9999, make it:

 ports:
      - "9999:3322"

Note: If you want to add python dependencies, add them to requirements.txt and run the jupyter-dev version of the docker-compose file:

docker-compose -f docker-compose-dev.yml up --build

Note: if you want to run SQL commands againt the database directly, you can do that by starting a Terminal within Jupyter and running the PostgreSQL shell:

psql -h localdb -U tester id_test

Note: If you want to connect to a remote database, edit the docker.env file with the DB url for your remote database.

Clone this wiki locally