While NewsClouds strives to be as accurate as possible, it is nonetheless a representation and has some tradeoffs/inaccuracies due to technical limitations. In this paper we will analyze how NewsClouds works and what its limitations are.
NewsClouds can be broken down into a series of problems as follows:
Story Identification - What is a story? How do you group articles from different news sources into one narrative. When does an article on a tangent stop being included?
Phrase Counting - How do we count words and phrases (n-grams of words)?
Phrase Weighting - How do you proportionally weigh words for their relevance per date, relative to words from the same source?
Source Intersections - How do we calculate which phrases are used between different sources and what their weights should be?
Venn Diagram Creation - Given a set of phrases and weights per set, how do we calculate the size of each set and create a venn diagram from those constraints?
Normalized Sizing - Given that weights that are only internally consistent for each source, how do you compare them? How do you generate sizes that are ‘good for display’, i.e. preventg outliers from dominating visually.
Word Placement - Given word sizes and a venn diagram, how do you place words into the venn diagram?
Speed Optimizations - These calculations tend to be slow and costly, how can we optimize them for regular use on the web?
Each of these steps will be defined by the inputs they take and outputs they give, and the process within will be described.
Input: Story query (e.g. Charleston Shootings)
Output: Array of articles with URLS, associated source, body text and extracted entities.
Process: This question is somewhat divorced from the rest of the problem, so a seperate paper will be written about in-depth solutions for this. However, a number of different solutions were explored as discused below, the main takeaway however was that NewsClouds as a system is highly suspectible to noise.
Glue - Using the backend infrastructure of Glue, closed captions from TV were piped in.
News APIs - Using freeform search on News APIs, articles per story could be fetched but were often noisy. A number of selection techniques were tried such as entity extraction and matching, but ultimately failed to sufficiently eliminate noise.
Manual Curation - Workers from Mechancial Turk were hired to manually select stories from selected news articles related to the topic. Though costly, this solution was the most noise free and was used for most of the work shown here.
Input: A list of articles per source.
Output: A dictionary object, per source, of words with counts of the literal number of times they occur.
Process: Each article is run through a n-gram (1-3) phrase extraction algorithm called Gramophone (www.npmjs.com/package/gramophone). All phrases of 3 characters or less or with numbers in them are discarded, because the gramophone algorithm is inaccurate for those cases.
Input: An array of articles per source, each with a dictionary of phrases and counts for their occurence.
Output: For each source, an array, one for each day that is being computed, each array has an dictionary of words with weights for that day. Weights are unbounded and unnormalized between sources.
Process: For each computed day (i.e. each day that has at least 1 article come out), weights for each word must be computed. They are initialized to the weight counts from the previous step and then go through the following multipliers:
**Date Sizing ** - On a particular day, we want articles around that day to have the most contributions, so each word is multiplied by an exponential that is weighted by the difference in days between the word’s article and the computed date. This exponential goes from 15-1 from 0 days to 4 days (a word that comes out on the day of, is worth 15x as much as a word from an article 4 days ago). Any words from articles in the future (i.e. past the current date) are multiplied by 0.
TF-IDF - The TF-IDF algorithm (term frequqnecy inverse document frequency) finds words that are important to individual documents within a larger corpus. Each word weight is multiplied by its tf-idf score, this helps to surface words which are particularly important to that date and not words that are used throughout the story (e.g. shootings is a common word throughout a story but arrest might only be said on a few days).
N-gram Language Analysis - We also want to seperate words that are due to writing style vs words that are unique to the case. This is done by using the Google n-gram corpus which states the frequency of words occurence in the English language. Words which have a high positive differential (e.g. occur much more often in the article than in the english language) are multiplied by an multiplier ranging from 1-5.
Comparison between two newsclouds, one with tfidf weighting (top) and one without (bottom).
Input: For each source, an array, one for each computed, each array having a dictionary of words with weights for that day.
Output: For each date, an array of sets, each set being an intersection of 1 or more sources. Each set has a dictionary of phrases that are common to all sources in the set.
Process: For each day, the weighted phrases of each source are combined using an intersection algorithm (to be open-sourced) which for an array of sets (e.g. A,B,C) with values, will return an array of intersections (e.g. [A,B],[A,C],[A,B,C] etc) each with a dictionary of phrases that the sets have in common. Weights for each phrase are added together.
Venn Diagram Creation
Input: A list of sets, each with a combined weight of the words in each set.
Output: A venn diagram object which lists the size of each set and its center (e.g. [A,B] with size 100 and center 20,50)
Process: This step uses the d3 venn diagram library (http://www.benfrederickson.com/venn-diagrams-with-d3.js/) which will take in a list of sizes per set and generate a venn diagram placement that best satisfies those criteria. There are some numerical arrangements which will not be perfectly described by the venn diagram so it will make the closest representation it can. The size of each intersection is generated by a sum of the phrase weight it contains.
Input: An array of intersections each with a dictionary of words and weights.
Output: For each source, a size of the cloud per source (relative to the other sources) and a dictionary of phrases weighted relative to each other and other sources, normalized for good display (i.e. literal font sizes to use).
Process: Before this step, all weights are unbounded so even though they may be accurate (one word may be talked about 100x more than another), it displays/renders very poorly. Normalized Sizing sacrifices some accuracy in weighting to make words display well together.
This works by normalizing values relative to the size of the set they are a part, the algorithm is given below:
Fraction = the fraction of the total venn diagram the set takes up.
If (Fraction < 0.05) - Normalize between 5-6px font sizes.
If (Fraction < 0.15) - Normalize between 8-16px font sizes.
If (Fraction < 0.25) - Normalize between 10-30px font sizes.
If (Fraction < 0.50) - Normalize between 10-35px font sizes.
If (Fraction >= 0.50) - Normalize between 15-45px font sizes.
Input: An array of sources each with a dictionary of phrases with font sizes per font.
Output: An array of sources each with a center point and radius as well as a dictionary of phrases with a X & Y coordinate (null if the phrase could not be placed).
Process: This step uses a heavily modified d3 wordcloud algorithm (https://www.jasondavies.com/wordcloud/about/). Each word is given a size, a position and a set of circles marked as either allowed or not allowed. The algorithm then randomly places the word within the allowed circles and checks if there are any collisions either with the not-allowed circles or other words. Checks are done using a glyph bitmap of the words.
Note: This is greedy non-deterministic algorithm so every word may be placed and if the results are not cached, slightly different placements will be generated each time, in rare cases with slightly different words being displayed.