Notes for May 5

Today we will work with some text data that has been formatted into a tree structure (inspired by this online tool). I ran a simple script to break documents into sentences. Then, I fed each sentence into a tokenizer (to get individual words). Using the sentences, I built a tree structure. Every sentence starts at the same root node, branching off based on which word comes next. This creates a graph as sentences re-use the same starting words. We'll start with song lyrics for this example because they often have a repetitive structure that's easy to observe.

This is a general trend in text visualization -- while we can just show plain words, we often want to post-process the text to observe deeper structures that may not be obvious in a simple read-through of a larger document.

To take the tree data structure and create a tree visualization, we make use of functions in the d3-hierarchy package.

The final code makes use of the Flextree addon library for the d3-hierarchy tree layout algorithm. We used this library to overcome issues related to spacing and sizing the text "nodes". It does a nice job optimizing for a simple tree layout that makes more efficient use of screen real estate.

HTML for today:
Code for today: