Teaching Computers to Read:
The Code Companion

This repository contains the official code companion to the book "Teaching Computers to Read: Effective Best Practices in Building Valuable NLP Solutions", by Dr. Rachel Wagner-Kaiser, with contributions from Tim Cerino.

Background

Building Natural Language Processing (NLP) solutions that deliver ongoing business value is not straightforward. This book provides clarity and guidance on how to design, develop, deploy, and maintain NLP solutions that address real-world business problems.

In the book, we discuss the main challenges and pitfalls encountered when building NLP solutions. We also outline how technical choices interact with (and are impacted by) data, tools, the business goals, and integration between human experts and the AI solution. The best practices we cover here do not depend on the cutting-edge modeling algorithms or the architectural flavor of the month. We provide practical advice for NLP solutions that are adaptable to the solution’s specific technical building blocks.

Through providing best practices across the lifecycle of NLP development, this handbook will help organizations – particularly technical teams – use critical thinking to understand how, when, and why to build NLP solutions, what the common challenges are, and how to address or avoid them. By doing so, they'll deliver consistent value to their stakeholders and deliver on the promise of AI and NLP.

The Code Companion builds on the content covered in "Teaching Computers to Read" (TC2R) by providing a set of exercises to help readers understand the challenges, experiments, and critical thinking that are required when working through a real-life problem with real-life (messy) data.

For more general information on the book and code companion, please see the main page here.

Code Companion Overview

Example Use Case

The use case presented in this repository is a common challenge of information extraction from a population of documents at scale. The goal of the business is to understand two key pieces of information - the payment terms and the limitation of liability. In order to succeed at this goal, we need to build a set of models to not only ingest and read the documents, but to identify the correct context relating to these clauses, and parse out a standardized answer.

We are presented with the following challenges:

Data must be extracted from the source documentation
There is a low amount of language variability in the vast majority of the documents
Understanding the data quality, overall corpus patterns, and common language patterns of these two clauses
Identifying a relatively small sample of data to annotate that maximizes model performance
Building an effective and high performing approach to consistently extract
Turning our results into a script and easily deployable solution

The code companion has two parts: a set of Jupyter notebooks, which cover the first four points above, and the Additional Exercises that cover the last point.

Notebooks

There are 6 main notebooks, which focus on the following topics:

Data Gathering and Selection
Data Ingestion
Pre-processing and Exploratory data analysis
Data Understanding and Annotation
Dataset Curation
Modeling Approaches

In each notebook, there are a set of exercises interspersed for the developer, to facilitate and encourage hands-on interaction with the content.

Additional Exercises

To finish the end-to-end development of this use case, there are additional exercises beyond the ones provided in the Jupyter notebooks. The goal of the additional exercises is to ensure that data scientists have a broader understanding of what it takes to build a fully useful solution - not a singular model.

As part of these exercises, the developer will build an end-to-end script for the ingestion of new files through to the end of creating an output that is usable and valuable for a (less technical) user.

The additional exercises are located here and walk the developer through key steps needed to prepare a solution for a productive, effective deployment.

Setup

Step 1: Clone repo

The first step to getting started is to clone this repository. We recommend this primer if you are new to GitHub.

Specifically, clone via the below command for this repo:

git clone https://github.com/TeachingComputersToRead/TC2R-CC-UseCase1

Step 2: Set up python environment

This code companion was built in Python 3.11, with the key required packages outlined in requirements.txt, and the detailed versions and dependencies available in requirements-detailed.txt.

To install and run the provided notebooks, we strongly recommend building a virtual environment specific to this project. A setup script is provided in the repo, which can be run by: ./setup.sh or bash setup.sh. Adjust permissions if necessary (chmod +x setup.sh).

Alternatively, ensure pyenv and virtualenv are installed locally and run each of the following lines of code from the main folder of the repo:

pyenv virtualenv tc2r_env
pyenv activate tc2r_env
pip install -r requirements.txt

Step 3 (Optional): Download related files

While the code companion will walk the developer through data collection, pre-processing, ingestion, and annotation (among others) steps, we have also made the primary data and data-related files available for download.

The files are available here on HuggingFace🤗, and are referred to as such in the notebooks. After the developer has downloaded the data folder from HuggingFace🤗, this will replace the data folder in the repo.

While we strongly recommend the developer to go through the steps to generate the contents of the data, these files can be a helpful sanity check on your work.

Step 4: Dive into the exercises!

Get started working through the use case end-to-end! Don't forget to go beyond the notebooks with the additional exercises.

About the Author

Rachel Wagner-Kaiser, Ph.D., has 15 years of experience in data and AI, entering the data science field after completing her Ph.D. in astronomy. She specializes in building natural language processing solutions for real-world problems constrained by limited or messy data. Rachel leads technical teams to design, build, deploy, and maintain NLP solutions, and her expertise has helped companies organize and decode their unstructured data to solve a variety of business problems and drive value through automation.

Connect on LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
notebooks		notebooks
resources		resources
scripts		scripts
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements-detailed.txt		requirements-detailed.txt
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Teaching Computers to Read:
The Code Companion

Table of Contents

Background

Code Companion Overview

Example Use Case

Notebooks

Additional Exercises

Setup

Step 1: Clone repo

Step 2: Set up python environment

Step 3 (Optional): Download related files

Step 4: Dive into the exercises!

About the Author

About

Uh oh!

Releases

Packages

Languages

License

TeachingComputersToRead/TC2R-CodeCompanion

Folders and files

Latest commit

History

Repository files navigation

Teaching Computers to Read: The Code Companion

Table of Contents

Background

Code Companion Overview

Example Use Case

Notebooks

Additional Exercises

Setup

Step 1: Clone repo

Step 2: Set up python environment

Step 3 (Optional): Download related files

Step 4: Dive into the exercises!

About the Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Teaching Computers to Read:
The Code Companion

Packages