-
Notifications
You must be signed in to change notification settings - Fork 27
Home
This is a tool for classifying, tagging, analyzing and visualizing news articles about internal displacement. The aim is to build a tool that can populate a database with displacement events from online news articles, which can be both classified by a machine, and then verified and analyzed by a human.
This project has been developed with a long-term vision for creating a widely applicable tool that can parse and classify web-based articles, and extract all reports that refer to displacement events and consequences. These reports are stored in a database and can then be accessed and plotted on maps (latitude and longitude coordinates are obtained for extracted locations).
This broader solution has been adapted to meet the specific competition submission and evaluation guidelines.
- Methodology
- Tutorials
- Data Model Overview
- Maintenance, Development & Enhancements
- Future Work
- Link to IDMC Challenge Page
The core scraping and processing functionality is written using Python 3.
For Natural Language Processing, we are using the Spacy library. In addition to installing the library, this also requires a language model to be downloaded (approx. 1GB of data).
You can run everything as you're accustomed to by installing dependencies locally, but another option is to run in a Docker container. That way, all of the dependencies will be installed in a controlled, reproducible way.
- Install Docker: https://www.docker.com/products/overview
- Build the docker container (unfortunately, this will take a while):
docker build -t internal-displacement .
or
docker-compose -f docker-compose-spacy.yml up
The spacy
version will include the en_core_web_md 1.2.1 NLP model.
It is multiple gigabytes in size. The one without the model is much smaller.
Either way, this will take some time the first time. It's fetching and building all of its dependencies. Subsequent runs should be much faster.
This will start up several docker containers, running postgres, a Jupyter notebook server, and the node.js front end.
In the output, you should see a line like:
jupyter_1 | http://0.0.0.0:3323/?token=536690ac0b189168b95031769a989f689838d0df1008182c
That URL will connect you to the Jupyter notebook server.
- Visit the node.js server at http://localhost:3322
Note: You can stop the docker containers using Ctrl-C.
Note: If you already have something running on port 3322 or 3323, edit docker-compose.yml
and change the first number in the ports config to a free port on your system. eg. for 9999, make it:
ports:
- "9999:3322"
Note: If you want to add python dependencies, add them to requirements.txt
and run the jupyter-dev version of the docker-compose file:
docker-compose -f docker-compose-dev.yml up --build
Note: if you want to run SQL commands againt the database directly, you can do that by starting a Terminal within Jupyter and running the PostgreSQL shell:
psql -h localdb -U tester id_test
Note: If you want to connect to a remote database, edit the docker.env
file with the DB url for your remote database.