Machine learning in physics
Machine learning is going to shape the future. In fact, it has already shaped the present. It is found in almost every facet of our lives, from speech recognition on our smartphones, to fraud prevention in banks, to medical diagnostics. As we browse the internet we are bombarded with recommendations and advertisements “based on your interests”, all of which are products of some machine learning algorithm taking in your browsing data and spitting out some video, article or product that it thinks will pique your interest. Due to this, scientists naturally wish to use this powerful tool to aid in their own research.
This begs the question, how exactly does it work? How is the machine actually learning? There are a couple of different methods to “train” the machine, or algorithm, how to recognise patterns, the most common of which is known as supervised learning. Essentially, this means inputting a huge amount of data which the algorithm performs some operation on and returns a label for this data. In supervised learning, the algorithm is first trained with a large set of data which has the labels already assigned. The algorithm takes in the data, performs its operation, and compares its output to the correct label which is provided. It can then see how right (or wrong) its output was, go back and adjust its process, and hopefully be slightly less wrong next time. Training these algorithms generally takes an enormous amount of examples of the data, but eventually the goal is that it will be able to recognise things on its own.
This is somewhat similar to how one could train a person to recognise something. If I showed someone a picture of an animal they had never seen before, for example a lemur, and asked them what it is, they would likely give some nonsense answer. Or if they had “trained” in recognising other animals, it might say it is a skunk, due to the blank and white tail. If I then told them “No, this is a lemur!”, they could go back, adjust their thinking process, and if I show them more pictures of lemurs, they should be able to recognise them. The same thing is happening in a machine learning algorithm, except they can be used to find patterns in massively complicated objects, provided they are trained with enough data.
A classic example of machine learning is image recognition. Say you want to train an algorithm to recognise pictures of elephants. For simplicity, it might convert it to greyscale. What a greyscale photo really is, is thousands of pixels, each with some amount of darkness assigned to it, which can just be represented as a number. You feed these numbers into the algorithm, telling it “This is an elephant” and it will eventually be able to tell you whether something is or isn’t an elephant independently. Of course, you would also have to feed it pictures of things that aren’t elephants so it can tell the difference between the two.
The pattern the algorithm recognises which it uses to define an elephant are often completely meaningless to us. Where we might recognise an elephant by seeing trunks, big ears, tusks and huge body, the algorithm may not be able to distinguish any of this. It may see completely different things as the defining features of an elephant.
The use of machine learning in scientific research is becoming increasingly useful and important. In particular, physicists and machine learning algorithms have very similar goals. Both are focused around developing a model, based on some experimental data, which describes the universal behaviour of whatever the experiment is investigating. Where they differ is in their methods. A physicist will try to find some elegant law describing the fundamental workings of nature. A machine learning algorithm will brute-force its way to some pattern that will fit the data, and be able to predict new results, but will often be completely meaningless to those who try to interpret the pattern. This can be extremely useful, as there are many processes which are still mysterious to science, but with machine learning, we can take these processes and use them to generate predictions anyway, without yet having a true understanding of them.
Here we will discuss just a few of the ways in which machine learning is being used in scientific research. First, we will explore how machine learning is being used to design new objects by mimicking evolution through natural selection. We will also see how machine learning is being used to better understand and generate designs for new molecules, before going deeper and seeing how it is used to understand the fundamental forces governing our universe.
A wonderful introduction to some concepts of machine learning, and in particular neural networks (a type of machine learning algorithm), can be found on 3Blue1Brown’s YouTube series on the topic.
Generative design of hardware and mechanical devices
The same design process is used to create every form of hardware from microscopes to spacecraft. This design process always involves developing a design which meets a set of physical constraints while minimising the cost of producing the object. This process has two main drawbacks. Firstly, it requires technical expertise and is highly manual; every dimension and feature of each part must be precisely defined using domain-specific software tools and knowledge. As well as this, the creativity and level of exploration of the design space is limited by the capabilities of the software available and how fast the designer can iterate through and generate new designs. This results in much of the viable design space being left unexplored.
A common misconception about machine learning is that it will design highly mechanical and logical structures to meet the required constraints. However, the designs produced using generative algorithms only use material where it is absolutely necessary to meet the constraints. This results in the development of very organic structures which appear to mimic those found in nature such as a tree’s branches or animal’s skeletal structure. This biomimicry arises from how the machine learning algorithm produces its final solution.
Most generative algorithms are genetic algorithms which act in a similar way to the process of natural selection. These recursive algorithms take some randomly generated initial designs and of those designs the ones which best fit the constraints are kept and mixed together to produce new and better designs. This process repeats while disregarding poor designs until the algorithm converges on a small number of possible solutions which may differ greatly depending on the initial designs they were evolved from.
The use of machine learning in design generation has found many applications in recent years. NASA has used this process to evolve antennas for their ST5 and TDRS-C missions. The main motivations for using a generative design over the classical design process were to overcome the requirements of significant amounts of domain expertise, time and labor when designing new antennae. The use of a generative design method also took into account the effects of the antennas surroundings which even the most skilled antennae designers find difficult to do due to the complexity this adds to the signal the antenna is trying to detect and emit. The fitness of each design was measured by evolving designs which minimize the fraction of the signal energy that wasn’t picked up by the antenna (VSWR), the amount of error in the received signal (RMSE) and the amount of data that would be lost when receiving the signal. The antennae were also designed to transmit signals using a similar measure of fitness.
Generative design of molecules
The discovery of new molecules and materials can usher enormous technological progress, in fields such as drug synthesis, photovoltaics, and redox flow batteries optimisation. However, the exploration of the chemical space of potential materials is of the order of 1060 and far too demanding computationally to run traditional optimisation methods. This is apparent in the time scale for deployment of new technologies from laboratory discovery to commercial product is historically 15 to 20 years. The underlying discovery process steps in material development involve generation, simulation, synthesis, incorporation in devices or systems, and characterization, with each step potentially lasting years at a time.
Generative (or inverse) design using machine learning allows this discovery process to close the loop, in concurrently proposing, creating, and characterizing materials with each step transmitting and receiving data simultaneously. In traditional quantum chemical methods, properties are revealed only after the essential parameters have been defined and specified, i.e. we can only investigate the properties of a material after we create it. Inverse design, as the name might suggest, inverts this, with the input now being the functionality, and a distribution of probable structures the output. In practical settings, a combination of functionality and suspected materials are used as an input to the generative neural network. In certain applications this can decrease the time involved through molecular discovery and deployment by a factor of 5
The mechanism for such neural networks is a joint probability distribution P(x, y): the probability of observing both the molecular representation (x) and the physical property itself (y). Without going too deep in the details a generative model is trained on large amounts of data, and attempts to generate data like it, a loss function encodes the notion of likeness, that measures the differences between the empirically observed probability distribution, and P(x, y) generated.
Generative machine learning methods for molecules, materials, and reaction mechanisms have only recently been applied, with the majority of publications emerging in the past 3 years. One such paper from William Bort and colleagues published in Feb 2021 demonstrated that AI was successfully able to generate novel chemical equations that were stoichiometrically viable. Startup companies such as Kebotix recently received $11.5 million in funding to continue their work in the promising field of automated material discovery.
Improving models in particle physics
Beyond discovering patterns in the maze of forces working to hold atoms in a material together, AI is used at the current bedrock of our understanding of matter; helping us understand the substructure of subatomic particles. Subatomic particles are the protons and neutrons that sit at the centre of an atom, their substructure is comprised of quarks glued together by particles which mediate the strong force between them, known as gluons. Exactly how these completely fundamental particles fit together to give the structure of matter as we know it is accounted for in what particle physicists call the Standard Model.
The notion of symmetry is critical to particle physics; explaining underlying particle physics dynamics by dictating interaction, and necessitating the existence of particles that have been experimentally shown to exist. It’s important that in solving problems of particle physics these symmetries are leveraged to simplify the physics, since the problems are so big it’s no longer as easy as plugging in the numbers. There is not enough supercomputing in the world to compute Standard Model predictions for dark matter scattering, for example.
The reason why the physics is so computationally heavy to run these simulations is that in low energy limits the theory is non-perturbative, meaning physicists can’t solve an easier problem and nudge their answer to the harder situation they really want to solve. They have to be exact. The solution is to take the region of 4 dimensional space-time that the quark and gluon fields dance within, and “grid-ify” it. To give a 2 dimensional example of why we would do this; if you were at a beach with waves gentle enough so that they weren’t crashing, you can imagine placing a large floating net of lights on the water. When night arrives, you can’t see the waves, but you can see the lights, and the way the waves move the lights gives a really good impression of the waves themselves.
To take this analogy one step further, imagine that it’s very costly to run a light, to collect data on a grid point that is, but you still want to see the waves, or at least know where they are. You have to then sample the grid with a couple of lights, but you want your sampling to be maximally informative since it’s so costly. We have a priori information that there is, say, 10 metres between successive waves, so that there’s little point looking 5 metres after one wave expecting to see another. There’s also no use in sampling lights too close together, as the information they carry is probably not that different (they’re correlated). Non-perturbative Standard Model sampling methods are similar, the most common is a combination of Hamiltonian mechanics and Markov Chain Monte Carlo. The Hamitonian mechanics aspect is what brings the a priori physics knowledge to the sampling process, telling us where is more useful to look, and the Markov Chain Monte Carlo optimises for choosing grid points that are as little correlated as possible to squeeze the most information out. However, the Markov method quickly becomes more expensive in its checking for correlation in neighbouring grid points than it would be to just sample the grid.
It happens in particle physics calculations that some grid points are less expensive to compute than others. It also happens that the aforementioned symmetries mean that the state of the field at one point strongly implies the state of the field nearby, further than what the Markov chain can check for without bloating in computing power requirement. This is where AI steps in. The AI can be taught the symmetries at play, so that we may sample in an inexpensive grid region close to an expensive one, and let AI (Convolutional Neural Net) transform the result of an inexpensive sampling under the rules of Standard Model symmetries into a sampling of the expensive region. What’s really cool is that mathematics responsible for the way the AI builds up its transformation means that its inference is provably exact, no approximation necessary, which is good because we’re working within a non-perturbative theory.
Best practices for physicists making ML models
To put all of this into practice, and to ensure that physicists are not wasting too much of their time wrestling with code, there needs to be a relationship between the physics and computer science faculties similar to that which exists between the physics and mathematics faculties. As with the introduction of calculus, machine learning and computational physics have and will push physics forward into new and interesting directions. Calculus provided a whole new way to model and think about physical systems – the language of change that describes our ever-so-dynamic universe. If a physicist wished to take advantage/partake in this new way of thinking, they first needed to learn the corresponding formalisms, or to put it simply, they needed to learn the mathematics; there is no getting around the derivatives, limits, integrals, proofs and theorems (with theoretical physicists being the bastard child of both the maths and physics departments). It is necessary then, to establish a common language that will enable computational physicists to collaborate and push forward our understanding of the universe in such a manner that minimises the amount of time some post-grad spend reinventing the wheel. Established fields such as Mathematics and physics don’t even use the same convention for spherical coordinates, let alone how to ensure that one’s machine learning model is usable across computational environments, because I’m willing to bet not many undergraduates know what docker is or a setup.py. The point is that physicists need to learn and embrace the principles and tools of software engineering to succeed.
Before even getting into the particulars of ML and the corresponding pipelines/workflows for training and evaluation using open source platforms such as Kubeflow, we need to discuss some of the basic foundations that every physicist must know, because I am nearly certain that most physics students have yet to complete their first pull request.
The most fundamental of these working principles is project structure and version control. This upholds the very nature of scientific enquiry and empiricism by enabling scientific results and evidence to be reproduced and confirmed by other researchers. So where in the repository is the raw data stored? The transformed training data? The script that runs the model? These are the sort of project structure issues that frameworks like Cookiecutter Data Science are designed to solve; they facilitate the implementation of more streamlined workflows. Being able to communicate your work and have others be able to reproduce it is essential, and version control in a nutshell is a way of tracking and managing changes to a set of files, in this case computer code and more then likely .py files and ipynb. One of the most popular bits of software that allows one to implement it, is Git. Now once you have your repository and have turned it into a git repository you now need a way to manage it, and this is where tools such as Gitlab, Bitbucket or Github come into play. These management tools then allow you to easily work on and develop your project or research in a constructive manner.
So now that you can share your code and how it has developed over time, now you need a way to make sure new changes to your code does not end up breaking your code. This is where testing comes in. It is essential and without going into too much detail (it’s in the name), this is code that makes sure your code works. Things like unit testing, integration testing and system testing. Unit testing, being the easiest to start implementing, are tests for individual elements of your code such as functions. There are many fantastic open source libraries such as Pytest that will have you writing unit tests faster than you can do your stats phys homework. Unit tests are essential to test the correctness of individual code components for internal consistency and correctness before they are placed in more complex contexts, such as a pipe line for training your model.
Now, your project is structured in such a way that you, and more importantly your colleagues, can now understand what you are doing, and it affords you a way of stopping/dealing with any bugs that arise from development. All you need from here is a way for other people to run your models on their own systems or resources. This brings us nicely onto environments. What distro are you using for Linux? What dependencies do you need? These questions are solved by using virtual environments (if you use anaconda, conda is already doing this for you). They are a tool that helps to keep dependencies required by different projects separate, which helps ensure you do not brick your project by creating dependency or package conflicts. Finally, you want to run your code on something other than your laptop; it’s now time to become familiar with kubernetes and the concept of containers . Containerising allows for scalability and for you to run your code in the cloud or on whatever compute resources your network has access to. In this context scalable means that your project can access more compute resources when needed and then release them when they are not. To do this a few years ago, you would have needed to be a software developer or an engineer as well as a physicist, but due to the maturity curve for technology, things that started off as being custom implementations can, if they become more ubiquitous/popular, start to emerge as products (e.g kubernetes), and the tech starts to become a commodity. We are now in the exciting situation where we can yet again stand on the shoulders of giants and can make use of these powerful tools.
Once you have a container image/images for your project you will be able to distribute your ML workloads when needed, which means you can go to bed and not need to leave your laptop running overnight.
Now you have a project that can be understood and replicated by others. The only question left is how can you share the final model? And again, we will look to another discipline outside, yet still connected to physics: computational neuroscience. ModelDB is a fantastic example of how a particular discipline makes their models accessible. To directly quote their website: “ModelDB provides an accessible location for storing and efficiently retrieving computational neuroscience models. A ModelDB entry contains a model’s source code, concise description, and a citation of the article that published it.”
So on a final note, a theoretical physicist not including in their work the proof to a given theorem should be as shocking as a computational physicist not linking their source code.