Category Archives: Programming

Geocode batch conversion + latitude, longitude, formatted address

I recently worked on a project that required getting latitude and longitude based on unstructured addresses. My advisor found me a website that run this conversion in batch and it turned out that website was quite decent.

http://www.findlatitudeandlongitude.com/batch-geocode/

Looking at its javascript a bit, they seem to be using Google’s API. So, if you just put in any piece related to something’s location, it will do its best to convert into normalized address, and latitude-longitude coordinates. Here’s an example.

"original address","returned address",latitude,longitude,accuracy,status code
"georgia tech","Georgia Institute of Technology, North Ave NW, Atlanta, GA 30332, USA",33.775618,-84.396285,3,200
"seoul","Seoul, South Korea",37.566535,126.977969,3,200
"white house","The White House, 1600 Pennsylvania Avenue Northwest, Washington, DC 20500, USA",38.897676,-77.03653,3,200

It seems to be a pretty useful research tool for me as I don’t have to code for myself to get to the Google’s geocoding API.

Kaffeine: prevent heroku app from sleeping

Most apps on the cloud are sleeping as the platform provider idles an app if there’s not a request for a certain amount of time. Heroku’s threshold is 1 hour. I found several discussions on the web:

I first tried adding new relic to my heroku app, but it seems an overkill and was a bit involved to get it work.

Then, I found this “Kaffeine” and it worked for me. One limitation is that it’s only for heroku apps. http://kaffeine.herokuapp.com/

For other cloud providers, here are a few alternatives I found.

Leveraging the color palette used by Google Visualization

I have used colors from the Google Visualization default palette for several projects. You can quickly generate all colors using a simple code below and pick those colors and save for yourself.

http://jsfiddle.net/qfymrmnd/

If you don’t have a way to extract RGB codes, here it is. The order is blue, red, orange, green, purple, and so on. Although I giving you about 20 colors here, I personally try not to go beyond 5-6 colors at the most. The colors in this list beyond that threshold seem to start repeating itself.

Circos Data Format Explained

Introduction

Circos is a visualization tool that draws network in a circular fashion. To my experience and best knowledge, it is the richest medium in which a network can be shown and data can be visually encoded.

What this post is NOT about

Circos is not the most user-friendly visualization tool on earth. For those looking for installation help, you’ve got a wrong number. Here are a few recommended reading that I referred when I had some problems in installation of circos.

Making sure all parts of circos are downloaded and properly loaded is just painful. I guess more than half of people who were fascinated by a circos graphic would give up trying to install it on their machine. It was just that hard for me.

Once you successfully install and are able to produce a graphic following offical tutorials provided by circos, you will be amazed by the comprehensive coverage of the official tutorial. However, the problem with having a comprehensive set of tutorials is that you cannot easily find a way to convert your traditional network viz into one of the cool circos viz—both conceptually and technically.

This post is intended for those who 1) have installed circos, 2) have produced some graphics following its tutorial, and 3) now want to plug your own data into the circos format. It’s my attempt to document the way I understand how one can transform a usual network visualization into a circos visualization.

Anatomy of a circos visualization

Before getting started, I’d like to emphasize that a circos visualization has different naming conventions for its parts. This made it hard for me to understand what their tutorial meant from the beginning. So, first off, I recommend you skim through the following nice summary slides on anatomy of circos graphics.

http://jura.wi.mit.edu/bio/education/hot_topics/Circos/Circos.pdf

From the figure above, remember four elements: (B) ideogram, (H) ticks, (F) highlights, and (E) links. Ideogram means the circular arc segments around a big circle with some thickness. Ticks show units of viz. Highlights are meant to emphasize a certain part of an arc. Lastly, links are connection between arcs.

How circos is conceptually different from the usual network visualization

If I were to create a usual network visualization of two-node graph, it would look like this.

Circos can visualize this kind of relationship for sure, but it is capable of doing the job for much more complex relationships. For example, suppose the two-node graph we saw above is now a multi-graph, i.e., a pair of nodes can have more than one edges. The figure below shows this network. Nodes 1 and 2 now have three edges between them with varying weights.

If you share some sense of aesthetics with me, you realize it’s ugly—more important, it’s arbitrary—and there must be a better way to deal with this sort of situation. And, circos is the one.

Understanding circos data format

Initial purpose of circos was to visualize relationship among chromosome in genes. Look at some of these wiki pages to see if it helps.

You may think its origin doesn’t really matter as long as it works to solve your problem. But, the problem is that circos documentation explains things using these biology jargons—karyotype, chromosome—which I think hinders understanding of general audiences.

After some hours of struggling, I devised my own way of interpreting the biological concepts built in circos. First of all, a chromosome is a node. So, you need to prepare a file that contains the list of all nodes. Suppose nodes 1 and 2 are US and China, respectively, and you are trying to visualize some trades between them. The first thing you need is something like this.

nodes.txt

chr - usa USA 0 2000 myblue
chr - chn CHINA 0 1000 myred

Let me explain one by one.

  • Two lines: We will have two nodes in our viz.
  • Every line starts with “chr – “: It’s just a convention denoting that this line describes a node (i.e., a chromosome).
  • “usa” / “chn”: node id
  • “USA” / “CHINA”: node labels
  • 0 to 2000 / 0 to 1000: node size (i.e., start and end position). The USA node is of size 2000 and the CHINA node is of size 1000. Note that circos only accepts integer as its positioning parameter.
  • The last element of each line denotes node color. I will explain how to define your own color in the next section. Here we focus on setting up data files in the right format.

Now that you prepared the list of nodes, let us get to list of edges. Edges are called “links” in circos. Recall the example of two-node multi-graph above. Suppose we are trying to implement three edges between USA and CHINA.

edges.txt

usa 200 500 chn 100 250 color=myblue_transparent
usa 700 900 chn 500 600 color=myred_transparent
usa 1200 1300 chn 800 850 color=myblue_transparent

Each line is formatted as “node1_id node1_start node1_end node2_id node2_start node2_end color=mycolor”. Note that a pair of nodes can have multiple edges and each edge occupies different part of each node. This will be more evident in the final graphic.

We have prepared all the basic elements so far. These two files are just bare bone. However, circos actually provides many more charting functionalities which I cannot go over in this post. Let me show you how to play around with one of them here. Suppose you want to highlight some parts of the nodes with different color. Then, you prepare the following file in addition to nodes.txt and edges.txt.

highlights.txt

usa 100 700 fill_color=myred
usa 700 1300 fill_color=myblue
usa 1300 1900 fill_color=myred
chn 100 500 fill_color=myblue
chn 500 900 fill_color=myred

Structure of this highlight file will become self-evident when we see the output visualization.

Putting all together into a visualization

At the heart of every circos visualization is the configuration file. A config file contains a list of commands (or directives) you want the viz engine to perform. Your custom definition of color, font, and placement all go into the config file. Now that you have all three data files ready—nodes.txt, edges.txt, and highlights.txt, you just need to invoke these files using the right language in the config file. Let’s say your main configuration file is named “usachn.conf” under “etc/” folder and data files reside in “data/usachn/” folder.

First, read in your nodes by this.

karyotype = data/usachn/nodes.txt

Your edges are read by this.

<links>
<link>
file = data/usachn/edges.txt
ribbon = yes
flat = yes
radius = dims(ideogram,radius_inner)-30
bezier_radius = 0r
</link>
</links>

“ribbon” and “flat” should be set “yes” in order to make circos render edges as defined in our data file: edges.txt. “radius” determines where links start. In this case, edges are drawn 30 pixel inside of the inner circle of ideogram. (Ideogram means the circular ring of nodes.) “bezier_radius” determines curvature of the edges.

Highlights are called in as follows.

<plots>
<plot>
type = highlight
file = data/usachn/highlights.txt
r0   = dims(ideogram,radius_inner)-5-15
r1   = dims(ideogram,radius_inner)-5
stroke_color = dgrey
stroke_thickness = 0p
</plot>
</plots>

Output file destination is put in as follows.

<image>
<<include etc/image.conf>>
file* = circos-usachn.png
</image>

Lastly, your custom colors (“myblue” and “myred”) are defined in the configuration file as follows.

<colors>
myblue = 0, 0, 255
myred = 255, 0, 0

myblue_transparent = 0, 0, 255, .5
myred_transparent = 255, 0, 0, .5
</colors>

For each line of custom color definition, the first three elements are R, G, B, and the fourth optional field is for alpha (transparancy). 0 is fully opaque and 1 is fully transparent.

The full configuration file can be viewed and downloaded here. You run circos using this config file in the command line as follows.

circos -conf etc/usachn.conf

And finally the resulting visualization will look like this.

It may not be as fancy as what you saw on the internet, but you probably have a better idea by now on how circos interpret your commands and data. You can play around with some of the parameters in the config file to see which setting leads to which output feature.

Conclusion

Circos is just impressively rich medium for visualization. It provides tons of other visualization elements such as histogram or 2D plots. Admittedly, I don’t know everything circos offers. But, when I struggled with the conceptual aspect of circos, I couldn’t find a simple to-go example on the web. All documentations and even tutorials seem very archaic to me. (Now I understand them better probably.) So, I decided to write one. Hope it helps!