I was having trouble getting Polyglot (NLP library) to run on my laptop, but I eventually got it working in Docker. I’m going to put some notes here, in case anyone else wants to experiment with it.
The difficult part was getting the ICU dependencies to work.1 I switched from a regular Python image to Anaconda Python (see the Dockerfile
below), and that cleared up the dependency problems.
Code
Below are the contents of src/Dockerfile
. (There’s an unfinished Flask app in my project, which is why it’s exposing port 4444
. I didn’t want it to clash with another Flask app running on port 5000
.)
FROM continuumio/anaconda3
RUN apt-get update && apt-get install -qq -y \
build-essential libpq-dev vim --no-install-recommends
ENV HOST 0.0.0.0
ENV DEBUG true
ENV PORT 4444
ENV INSTALL_PATH /app
RUN mkdir -p $INSTALL_PATH
WORKDIR $INSTALL_PATH
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY . .
# Install the models
# RUN polyglot download embeddings2.en transliteration2.ar
EXPOSE 4444
CMD ["gunicorn", "--bind", "0.0.0.0:4444", "--workers", "3", "app:app"]
This is src/requirements.txt
:
# Data
numpy
pycld2
morfessor
pyicu
polyglot
# Flask
flask
Flask-Cors
gunicorn
And in the root of the project is a docker-compose.yml
file:
version: '3'
services:
py:
build: "./src"
ports:
- "4444:4444"
volumes:
- ./src:/app
Then you can start it with:
$ docker-compose up --build
Find the container ID with:
$ docker container ps
Then to enter the container:
$ docker container exec -it <container_id> bash
From there you can start Python:
(base) root@9a34ddbc7255:/app# python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
Get the models (in this case “th” for Thai):
>>> from polyglot.downloader import downloader
>>> downloader.download("transliteration2.th", quiet=True)
True
>>>
I think it’s possible to download those from the Dockerfile
, but I haven’t set that up yet. For example, to download sentiment analysis data for English, do:
$ polyglot download sentiment2.en
I’m still having a problem downloading the Polish transliteration models, but I don’t need them for now. The error message for that was:
>>> from polyglot.downloader import downloader
>>> downloader.download("TASK:transliteration2", quiet=True)
[polyglot_data] Error downloading 'transliteration2.pl' from <http://p
[polyglot_data] olyglot.cs.stonybrook.edu/~polyglot/transliteratio
[polyglot_data] n2/pl/transliteration.pl.tar.bz2>: HTTP Error
[polyglot_data] 403: Forbidden
False
1 To help people find this page in search engines, some of the ICU-related error messages included the following text:
KeyError: 'ICU_VERSION'
and
ModuleNotFoundError: No module named 'icu'