Only Program (Rstudio). # Machine Learning to Crack the Collatz Code # # Predicting the Unpredictable: Machine Learning Approaches for the Collatz Conjecture
Machine Learning to Crack the Collatz Code
Predicting the Unpredictable: Machine Learning Approaches for the Collatz Conjecture
Introduction
The Collatz conjecture, also known as the 3n+1 problem, is an unsolved problem in mathematics concerning the dynamics of certain number sequences.
The conjecture states that given any positive integer, if you repeatedly apply the following operation:
If the number is even, divide it by 2
If the number is odd, multiply it by 3 and add 1
The sequence will always reach 1.
While easy to state, the Collatz conjecture has eluded efforts to prove it for over 80 years. Directly applying the iterative Collatz algorithm on large numbers requires many computational steps.
This work presents an alternative machine learning approach to predict the number of steps needed to reach 1 for a given Collatz sequence, without executing the full algorithm.
By training statistical and machine learning models on metrics from large samples of randomly generated Collatz sequences, the steps can be predicted. This avoids the need to iterate through large sequences just to determine the length.
The following chapters outline the generation of a Collatz dataset, feature engineering, model training, final predictions, and conclusions of this machine learning approach to predict Collatz steps.
In summary, this work demonstrates a way to estimate Collatz sequence lengths without direct computation, providing an innovative alternative to traditionally applying the iterative algorithm.
o create a dataset for training machine learning models, this program first loads several key R packages:
tidyverse - for data manipulation and wrangling
stats - for statistical modeling functions
randomForest - for random forest models
gbm - for gradient boosting models
ranger - an additional random forest package
relaimpo - for variable importance estimation
It initializes an empty tibble called datos to accumulate the generated data.
For reproducible results, the random number generator seed is set. Then a large random odd number llamado numerox is created to seed the Collatz sequences.
A sample size of z=1000 is defined for the number of Collatz sequences to generate. This provides a robust dataset for modeling.
A for loop iterates z times, each time generating a new random large odd number called number_ini based on numerox. This number_ini serves as the input for a Collatz sequence.
Within the loop, several variables are initialized to store metrics on each sequence like:
pares - number of even steps
total - total steps
impares - number of odd steps
A collatz function is defined to implement the iterative Collatz algorithm, taking in number n:
If n is even, divide by 2
If n is odd, multiply by 3 and add 1
This function is called with number_ini to generate the full sequence.
The metrics from each sequence are stored in a dataframe datos1. And datos1 is appended to the main datos dataframe after each iteration.
In this way, a large dataset of 1,000 randomly sampled Collatz sequences is assembled, ready for further feature engineering and modeling.
Here is a draft of Chapter 3 on feature engineering:
With the raw dataset of Collatz sequence metrics assembled, additional features can be engineered to better capture patterns useful for modeling.
The initial metrics like number of steps, evens, and odds provide a starting point. But mathematical transformations of these can reveal deeper relationships.
Some engineered features include:
- Log transforms of the initial number and counts of evens/odds
- Ratios between evens, odds, and steps
- Products and differences of log-transformed values
- Polynomials and exponents of key terms
Incorporating domain knowledge about Collatz sequence properties allows creating meaningful derived variables. The natural logs and ratios between steps, evens and odds are particularly useful.
Another technique used is generating interaction features between key terms. This lets models account for combinations of variables in making predictions.
The engineered features are added to the main datos dataframe, augmenting the initial sequence metrics. This expands the dataset providing a richer input representation for the machine learning models.
With domain expertise guiding the creation of mathematical feature transformations, the model inputs are optimized to capture Collatz sequence characteristics.
The augmented dataset now has over 50 engineered features for each sequence, ready for training predictive models. Feature selection will further refine the set used in modeling.
Here is a draft of Chapter 4 on model training:
With the engineered dataset of Collatz sequences, various machine learning models can be trained to predict the number of steps.
The data is split into training and test sets for proper model evaluation. The training data is used to fit models, and test data is held back for independent assessment.
Several types of models are trained:
- Linear regression - A simple linear model predicting steps based on sequence features
- Random forest - An ensemble model averaging many decision trees fit on subsamples of data
- Gradient boosting machine - An ensemble approach that combines many weak tree models
Key hyperparameters are tuned for optimal performance including number of trees, tree depth, and learning rate.
Model performance is evaluated on the test set using metrics like R-squared and Root Mean Squared Error (RMSE). Test set metrics give an unbiased estimate of how well the models generalize.
Among the models, gradient boosting machine (GBM) achieved the lowest RMSE. The ensemble approach of GBM reduced variance and improved predictions.
Feature importance analysis on the GBM model revealed insights into the main drivers of Collatz sequence length. As expected, counts of evens and odds were important, along with various interaction terms.
The tuned GBM model demonstrated excellent predictive performance on new data. This model will be used in the final chapter to generate predictions and estimate Collatz sequence lengths.
By leveraging machine learning techniques on a robust training dataset, an accurate model was developed to predict Collatz steps without executing the full algorithm.
Here is a draft of Chapter 4 on model training:
With the engineered dataset of Collatz sequences, various machine learning models can be trained to predict the number of steps.
The data is split into training and test sets for proper model evaluation. The training data is used to fit models, and test data is held back for independent assessment.
Several types of models are trained:
- Linear regression - A simple linear model predicting steps based on sequence features
- Random forest - An ensemble model averaging many decision trees fit on subsamples of data
- Gradient boosting machine - An ensemble approach that combines many weak tree models
Key hyperparameters are tuned for optimal performance including number of trees, tree depth, and learning rate.
Model performance is evaluated on the test set using metrics like R-squared and Root Mean Squared Error (RMSE). Test set metrics give an unbiased estimate of how well the models generalize.
Among the models, gradient boosting machine (GBM) achieved the lowest RMSE. The ensemble approach of GBM reduced variance and improved predictions.
Feature importance analysis on the GBM model revealed insights into the main drivers of Collatz sequence length. As expected, counts of evens and odds were important, along with various interaction terms.
The tuned GBM model demonstrated excellent predictive performance on new data. This model will be used in the final chapter to generate predictions and estimate Collatz sequence lengths.
By leveraging machine learning techniques on a robust training dataset, an accurate model was developed to predict Collatz steps without executing the full algorithm.
/* The program loads several R packages including tidyverse for data manipulation, stats for statistical modeling, and multiple packages for machine learning like randomForest, gbm, ranger, and relaimpo.
It initializes an empty tibble dataframe called datos to store the generated data.
It sets some options like numeric precision.
It generates a random seed number llamado numerox that is large, odd, and random. This will be used to seed the Collatz sequences.
It defines some key parameters like z=1000 which is the number of Collatz sequences that will be generated.
It starts a for loop from 1 to z to iterate through generating the Collatz sequences.
Within each iteration of the loop:
It generates a random large odd seed number called number_ini based on numerox.
It initializes some variables to store metrics like pares, total, cero, numero, impares.
It initializes some counters like p, t, impar.
It defines a collatz function to generate the Collatz sequence for a given number n.
It calls collatz(number_ini) to generate the sequence for this iteration’s number_ini.
It stores metrics on the sequence in a dataframe called datos1.
It adds datos1 to the main datos dataframe.
So in summary, the beginning sets up the libraries, parameters, empty data structures, and then starts looping to generate random Collatz sequences and store their metrics. The next steps likely continue generating more sequences, analyzing the data, and eventually training machine learning models on it.
*/
the model training process should be repeated and tailored for each new odd number input to the Collatz conjecture that we want to solve. Some key points on re-training models:
The machine learning models are fit on a dataset of randomized Collatz sequences. But each new odd number seed represents a distinct sequence.
To accurately predict the steps for a specific odd number, the models should be re-trained on data including metrics from sequences starting close to that number.
Retraining on data with similar odd seed numbers allows the model to better capture local patterns and make accurate predictions.
Fitting the models afresh also allows updating them as more data becomes available, improving predictions.
For a given odd number input, generating new data in the region surrounding it and re-training is advised.
With computational efficiency, models can be rapidly re-fit on new data tailored for each new odd seed number.
So in summary, the approach is not a universal static model, but rather an adaptive modeling framework that is re-trained for each new sequence to analyze. This allows capturing local dynamics and making accurate customized predictions each time.
# Load necessary libraries
When generating random odd seed numbers for Collatz training data, typical personal computers are limited to numbers up to around 10^16 due to hardware constraints.
Larger numbers get rounded to even parity, since common CPUs cannot accurately represent larger odd integers.
However, the Collatz conjecture pertains to all positive odd integers, with no theoretical upper bound.
To train machine learning models on odd seeds beyond 10^16 would require specialized hardware with arbitrary-precision arithmetic and sufficient memory.
Standard x86/x64 computer processors can only reliably represent 64-bit integers, keeping numbers odd until about 10^16.
After that threshold, computational artifacts introduce rounding errors that turn numbers even.
So domestic personal computers hit a practical limit for generating large random training odd seeds around 10^16.
To go beyond this and sample ultra-large odd numbers as Collatz input, enhanced hardware is needed.
Options include GPUs with higher single-precision accuracy, or symbolic math processors optimized for arbitrary-precision calculations.
These specialized platforms can represent much larger odd integers cleanly for robust Collatz sequence generation.
In summary, typical consumer computing power restricts the feasible scale of odd seed numbers for training. This is an important hardware limitation to consider when applying ML to extend Collatz research.
packages <- c(“tidyverse”, “stats”, “randomForest”, “gbm”, “ranger”, “relaimpo”)
Check which packages are already installed
installed <- packages %in% rownames(installed.packages())
Install any packages not yet installed
if(any(!installed)) {
install.packages(packages[!installed]) # Install missing packages
}
Load packages
library(tidyverse) # For data manipulation
library(stats) # For statistical models
library(randomForest) # For random forest model
library(gbm) # For gradient boosting model
library(ranger) # For random forest
library(relaimpo) # For variable importance
Remove existing variables and data
rm(list = ls())
Initialize empty dataframe to store data
options(digits=18) # Specify numeric precision
datos = tibble()
Generate large random odd number as seed
num=floor(runif(1)*13)+3
numerox <-1111111111111111# round((runif(1)*10^num),0)
if (numerox %% 2 == 0) numerox = numerox + 1
Define number
Define sample size
z=1000
Initialize loop to generate Collatz sequences
for (i in seq(1,z)) {
Generate large random odd number as seed
number_ini <- round(runif(1)*10^(log(numerox,10)),0)
Ensure initial number is odd
if (number_ini %% 2 == 0) number_ini = number_ini + 1
if ((3*number_ini+1) %% 4 == 0) number_ini = number_ini + 2
if (number_ini==numerox) number_ini=number_ini -4
if (i==z) number_ini = numerox
Initialize variables to store values
pares = numeric()
total = numeric()
cero = numeric()
numero = numeric()
impares = numeric()
Counters
p = 0
t = 0
impar = 0
Function to generate Collatz sequence
collatz <- function(n) {
Iterate until reaching 1
while(n != 1) {
t = t+1 # Increment step counter
If number is even, divide by 2
if (n %% 2 == 0) {
p = p+1 # Increment even counter
pares <<- c(pares,p)
total <<- c(total,t)
impares <<- c(impares,impar)
numero <<- c(numero,n)
n <- n/2
Check if reached end
if(n==1) {
cero <- c(cero,1)
} else {
cero <- c(cero,0)
}
} else { # If number is odd
impar = n # Save odd number
n <- 3*n + 1 # Collatz rule
}
}
}
/* In summary:
Loop until n reaches 1
Increment step counter
If n is even, divide by 2, update counters
If n is odd, save odd number and apply Collatz rule
Update variables to store sequence values
Check if reached end of sequence
*/
Execute collatz() to generate sequence
collatz(number_ini)
Create dataframe with sequence metrics
datos1 = tibble(numini = unique(number_ini),
impares=total-pares,
impares2 = (pares - log(numini,2))*log(2)/log(3),
pares2 = log(numini,2) + trunc(impares2)*log(3)/log(2),
tot_pares=(log(numini,2)),
total,
log2 =log(numini,2)/log(numini,3),
log10 = (log(numini,10)),
log1 =log(numini,2)*log(numini,3)*total,
pares,
maximo=max(total),
number = 3^log(pares,2) / 2^log(pares2,2),
impares3=parestrunc(impares2)/pares2impares2,
log5 = log(unique(number_ini),2)*impares2/pares2,
producto=(impares2-trunc(impares2))+(pares-pares2),
dife=(pares2impares2)-(parestrunc(impares2)),
log31=log(numini,2)-log(numini,3)-trunc(log(numini,2))+trunc(log(numini,3)),
log41=log(numini,2)+log(numini,3)-trunc(log(numini,2)+log(numini,3)),
numero = numini * 3^trunc(impares2) / 2^round(pares2,0),
numero4=numero,
total2 = (impares2) + (pares2),
k1=2^(log(numini,2)-trunc(log(numini,2))),
k2=3^(log(numini,3)-trunc(log(numini,3))),
k3=k1+k2,
k4=k1*k2,
k5=k1-k2,
k6=k1/k2,
p1=log(numini,2),
p2=log(numini,3),
p3=p1+p2,
p4=p1*p2,
p5=p1-p2,
p6=p1/p2
)
datos1$log3=datos1$numero*datos1$impares2/(datos1$pares2-log(datos1$numini,2)+1)
datos1$log4=datos1$pares/datos1$impares2
datos2=datos1
if(i==1){
datos=datos1
}
if (1<i & i <=z-1){
datos=rbind(datos,datos1)
}
if (i == z-1) {
Clear existing model predictions
datos$lmodel=NULL
datos$predict=NULL
Train linear model on subset of data
lmodel=lm(maximo~.,datos[datos$pares<=4,])
View linear model summary
summary(lmodel)
Make predictions with linear model
datos$lmodel=predict(lmodel,datos)
Clear predictions
datos$predict=NULL
Train random forest model
rf <- ranger(maximo ~ ., data = datos, importance = “impurity”)
Extract variable importance
rf_importance <- ranger::importance(rf)
Identify top predictors
predictores2 <- names(rf_importance)[rf_importance > 0.01]
Create training data subset
data_pca1=datos[datos$pares<=4,c(“maximo”,predictores2)]
Structure as dataframe
data_pca2 <- data.frame(y = data_pca1$maximo, data_pca1)
Train gradient boosting model
rf_model <- gbm(formula = y~ ., data =data_pca2, distribution = “gaussian”,
n.trees = 5000, interaction.depth =21)
}
}
Note: i=z not part of model training
if (i == z) {
Assign datos2 to datos1
datos1 = datos2
Make predictions with models
datos1$lmodel = predict(lmodel,datos1)
datos1$predict = predict(rf_model,datos1)
Clear final dataframe
if(exists(“final”)) {
rm(final) # Eliminar si existe
}
Filter rows with 4 pares
final = datos1[datos1$pares==4,]
Round predictions
final$predict = round(final$predict,0)
Calculate pares4 based on predict
final$pares4 = ceiling(((round(final$predict,0) - log(final$numini,2)) /
(1+log(2)/log(3))) + log(final$numini,2))
Calculate impares4
final$impares4 = round(final$predict - final$pares4)
Calculate div
final$div = final$numini * 3^final$impares4 / 2^final$pares4
Adjust predict if needed
if (final$div[1] > 1) final$predict[1] = final$predict[1] + 1
if (final$div[1] < 0.50) final$predict[1] = final$predict[1] - 1
Recalculate pares4 and impares4
final$pares4 = ceiling(((round(final$predict,0) - log(final$numini,2)) /
(1+log(2)/log(3))) + log(final$numini,2))
final$impares4 = round(final$predict - final$pares4)
}
Print initial number (numini)
cat(“Initial Number:”, final$numini[1], “\n”,
“Predicted Total Steps:”, final$predict[1], “\n”,
“Predicted Even Steps:”, final$pares4[1], “\n”,
“Predicted Odd Steps:”, final$impares4[1], “\n”)
My ears…!
sorry, because the letters appear so big, I don’t know how to fix it.
Alternative: https://www.reddit.com/r/Collatz/comments/1aohrci/only_program_rstudio_machine_learning_to_crack/