Machine learning galaxy classification. I. Project description

This is the first in an eventual series of posts in which I will play with some simple machine learning tools to revisit an old project: assigning galaxy morphology classifications based on survey photometry. Today, I am just going to describe the project and its motivation.

While my primary research interests are in galactic astrophysics—meaning that I study the constituents of our Milky Way galaxy, i.e., stars—I made a brief foray into extragalactic research as an undergraduate. In the summer of 2010, I was selected to work with Dr. Samir Salim at Indiana University as part of the NSF’s Research Experiences for Undergraduates (REU) program. This is a great opportunity for undergraduate students to visit another institution to conduct high-level research for a couple of months, and I can’t recommend it more strongly.

The project that I was assigned was to come up with a scheme for classifying galaxies based on available survey photometry. Samir’s idea was to make informed guesses by comparing the survey data to nearby galaxies that are very well characterized. While it was not part of my vocabulary at the time, this amounted to my first machine learning project. I did what made the most intuitive sense to me: I came up with a k-nearest neighbors approach (where k equaled 1). I have since developed a deeper interest in machine learning and am now equipped to beat my previous results in a fraction of the time.

I’ll focus mostly on the machine learning details for the blog, but briefly… there are many different types of galaxy, and studying the observational relationships between types can help us to understand how they form and evolve. The best-known system of classification is Hubble’s tuning fork diagram, depicted below. I focus my classification on four general galaxy types: ellipticals that are generally massive and old with minimal ongoing star formation, know for their blobby shapes; disk-like spirals, like our own Milky Way, which are named for their spiral arms in which new stars form vigorously; lenticular galaxies, which share the diskiness of spirals with the otherwise featurelessness and low star formation rates of ellipticals; and, finally, irregular galaxies (not pictured), whose shapes are more exotic, perhaps due to their relatively low masses.

Hubble Tuning Fork Diagram

It’s probably for the best that I have lost all of my previous code; I don’t think I would want to read it now. Luckily, the abstract of the poster that I presented on this project at the 217th AAS meeting (my first) in Seattle has been archived online. This is what I am trying to exceed:

Galaxies have been historically classified based on their morphologies, requiring visual identification of their defining structural features. This can be extremely time consuming and especially difficult for galaxies that are poorly resolved or at certain orientations. To quell these issues, a new automated proxy for visual morphological classification is needed. We utilized photometric magnitudes and angular size information from the Sloan Digital Sky Survey (SDSS) and Galaxy Evolution Explorer (GALEX) data releases to categorize galaxies at redshifts near z=0.1 as elliptical, lenticular, spiral, or irregular. A galaxy of interest has its catalog photometry corrected for redshift and is then compared to galaxies of known type from the Third Reference Catalog of Bright Galaxies (RC3) by a chi-squared goodness-of-fit analysis of their spectral energy distributions, radial light concentrations, ellipticities, and UV-to-optical size ratios. Testing this method on the RC3 galaxies themselves yielded probabilities that each of the four outcomes result from each source type. Overall, results are drawn from the correct sources a majority of the time, at 52% for ellipticals, 45% for lenticulars, 61% for spirals, and 71% for irregular galaxies. These likelihoods held up when the method was tested on galaxies near the target z 0.1 redshift with rough classifications available from the COSMOS survey. Finally, the method was applied to numerous galaxies at z 0.1 with established star formation rates and stellar masses to reveal connections between these values and galactic type. Most notably, lenticular galaxies, while of comparable mass to ellipticals, were shown to be undergoing more current star formation, and irregular galaxies were observed to contain generally less stellar material than spirals. This project was supported by the National Science Foundation as part of the Summer 2010 Astronomy REU Program at Indiana University.

In my next (eventual) post in this series, I will collect and process the catalog data for bright, nearby galaxies that I will use to train my machine learning algorithms. Stay tuned.

EDIT: I can’t believe I actually managed to find a copy of my first-ever conference poster in a disused email account. Those interested can download it here, but it’s 6.66 MB—you stand warned.