GitHub - hyunwoongko/solar-vs-glm-vs-phi: Solar vs GLM vs Phi
Key Points
- 1This paper refutes the claim that Solar-Open-100B is derived from GLM-4.5-Air, arguing that high cosine similarity of Layernorm parameters, the basis of the original claim, is an unreliable indicator for model derivation.
- 2The author demonstrates that Layernorm parameters' initialization near 1.0 and low variance naturally lead to high cosine similarities even between unrelated models, a phenomenon corroborated by a controlled GPT2 toy experiment.
- 3Furthermore, analyses using centered cosine similarity, Pearson correlation, and various absolute and relative difference metrics failed to show consistent evidence that Solar is uniquely closer to GLM compared to other models like Phi.
This paper critically examines the claim that Solar-Open-100B is derived from GLM-4.5-Air, which was based on observed cosine similarity between Layernorm parameters. The author argues that Layernorm parameter cosine similarity is not a reliable indicator of model derivation, providing a detailed analysis using multiple metrics and a controlled toy experiment.
The paper first refutes the original claim regarding Layernorm cosine similarity *within* a single model. The original assertion was that within the same model, Layernorm parameters from different layers exhibit low cosine similarity. This paper demonstrates the opposite: for Solar, GLM, and Phi models, Layernorm parameters across different layers (e.g., layers 10, 20, 30) exhibit very high cosine similarity, typically around 0.99. The discrepancy with the original claim is attributed to a specific comparison used in the original analysis: comparing input_layernorm of layer 0 with Layernorm parameters of later layers. The paper posits that input_layernorm in layer 0 is unique as it directly processes raw input embeddings, potentially absorbing low-level statistics like input scale, variance, or token frequency bias, making its parameter distribution distinct from those of later layers which process normalized hidden states. This distinction can lead to lower cosine similarity when compared to other layers. However, when comparing post_attention_layernorm from layer 0 (which processes normalized inputs) with post_attention_layernorm from later layers, similarly high cosine similarities (above 0.92) are observed, reinforcing that Layernorm parameters within the same model typically maintain high similarity across layers, especially if they are not the initial input_layernorm.
Next, the paper addresses the observation that Layernorm parameters of the *same layer across different models* (Solar vs. GLM, GLM vs. Phi) show high cosine similarity (above 0.9). While the original claim used this as evidence for derivation, this paper argues it is a false positive. The explanation offered is that Layernorm (or RMSNorm) weights are commonly initialized to 1.0 (e.g., torch.ones) and tend to maintain a low variance and remain largely positive throughout training. This initial alignment of parameter vectors (pointing roughly towards ) is largely preserved, resulting in high cosine similarity even if models are trained independently on different data. To support this, a toy experiment was conducted using four small GPT-2 models trained on simple arithmetic progressions. Two models had Layernorm parameters initialized to 1.0, and two had them initialized randomly. After training, the models with 1.0 initialization showed Layernorm cosine similarities between different models of approximately 0.999, while randomly initialized models showed near-zero cosine similarities. This experiment suggests that Layernorm weight cosine similarity strongly reflects the common initialization prior rather than shared training data or derivation.
To provide a more robust analysis, the paper introduces and applies several alternative metrics beyond simple cosine similarity:
- Centered Cosine Similarity: This metric, defined as , calculates cosine similarity after subtracting the mean from each vector. This removes the influence of common offsets (like all elements being near 1.0) and focuses on the similarity of relative patterns. When applied to Layernorm parameters across different models (Solar vs. GLM vs. Phi), centered cosine similarity values mostly dropped to near zero, indicating that the high conventional cosine similarities were primarily due to the common distribution characteristics (e.g., mean and scale) of Layernorm weights rather than identical relative patterns. However, within the same model, different layers still exhibited moderately high centered cosine similarity (though lower than 0.99), suggesting some degree of pattern similarity even after mean removal.
- Pearson Correlation Coefficient: Similar to centered cosine similarity, this metric also showed low values between Layernorm parameters of different models and higher values within the same model, reinforcing the centered cosine findings.
- Mean Absolute Difference (Mean L1 Distance): Defined as , this metric measures the average element-wise absolute difference. For Layernorm parameters, this showed that between Phi and GLM was often smaller than between Solar and GLM, challenging the idea of Solar being more closely related to GLM.
- p99 Absolute Difference: This metric compares the 99th percentile of the absolute values of the parameter weights (). It is sensitive to differences in the "tail" distribution of the weights. Results showed that the between Phi and GLM was frequently smaller than between Solar and GLM, again suggesting no consistent closer relationship between Solar and GLM.
- Relative L2 Distance: Defined approximately as , this normalizes the L2 difference by the L2 norm of a reference vector, making comparisons fairer when absolute scales vary. For Layernorm parameters, indicated that Solar and GLM were often relatively far apart (e.g., 1.96 for a specific layer), and even Solar and Phi showed a meaningful distance (0.12). Crucially, for
k_projandv_projparameters (large matrices with identical shapes), showed significant distances between all model pairs, indicating distinct scales, in contrast to their small absolute differences.
- CV Difference (Coefficient of Variation Difference): Defined as , this measures the difference in relative dispersion. For all compared Layernorm parameters across Solar, GLM, and Phi, was almost zero, suggesting that despite differences in absolute values or patterns, the statistical distribution shape (mean-normalized standard deviation) of Layernorm weights is highly consistent across these models.
In conclusion, the paper asserts that relying solely on Layernorm weight cosine similarity to infer model derivation is misleading. The high cosine similarities observed are likely artifacts of common initialization strategies and the inherent properties of Layernorm parameters (low variance, positive bias). Analysis using centered cosine similarity, Pearson correlation, mean absolute difference, p99 absolute difference, and relative L2 distance consistently failed to provide clear, uniform evidence that Solar-Open-100B is uniquely or consistently more similar to GLM-4.5-Air than to Phi-3.5-MoE-instruct, or vice versa, especially in a way that would suggest a derivation relationship. This suggests that the models' parameter spaces are distinct enough that a direct derivative relationship cannot be established through these parameter comparisons.