On April 9th, aiXcoder unveiled its groundbreaking 7B code model developed in-house. This model outperforms all other open-source models of similar scale across multiple mainstream evaluation benchmarks, showcasing its exceptional prowess as a code model with a ceiling-breaking 7 billion parameters.
According to comprehensive results from evaluation benchmarks, aiXcoder-7B shines in real-world enterprise software development scenarios, unlike traditional exercise-based code generation approaches. This indicates its suitability for enterprise-level private deployment. The aiXcoder-7B Base version is now open-sourced and shared with developers, progressively rolling out on platforms such as GitHub, Gitee, and Gitlink.
The aiXcoder team, incubated at Peking University's Institute of Software Engineering, has been pioneering the intersection of AI and software development for a decade. They are dedicated to serving enterprise development scenarios, cementing their position as pioneers in intelligent software development. Code generation and completion achieve SOTA performance
In real-world programming scenarios, one often encounters a myriad of situations, and manually crafted test sets have limited capabilities. They may suffer from issues such as limited scale and diversity, difficulty in assessing contextual understanding, and challenges in measuring generalization ability. Therefore, the aiXcoder-7B model selects evaluation sets from different dimensions to comprehensively validate the model's actual capabilities and guide model iteration and deployment.
Across multiple mainstream evaluation benchmarks, including code generation, code completion, and cross-file contextual code generation, the aiXcoder-7B model demonstrates outstanding performance. It even surpasses a 34B code model with five times the parameters, achieving the current SOTA level and can be deemed as the most suitable foundational model for practical programming scenarios.
Evaluation Result 1: On mainstream code generation evaluation test sets such as HumanEval (comprising 164 Python programming questions), MBPP (consisting of 974 Python programming questions), and MultiPL-E (encompassing 18 programming languages), the aiXcoder 7B model significantly outperforms current models of similar scale. Evaluation Effect 2: Compared to tasks in evaluation sets like HumanEval, code generation in real development scenarios requires consideration of contextual information while writing code. In the evaluation set proposed by Santacoder (Ben Allal et al., 2023), which focuses on contextual code completion, aiXcoder-7B Base edition achieved the best overall performance when compared against mainstream open-source models like StarCoder 2, CodeLlama 7B/13B, and DeepSeekCoder 7B. To further refine the assessment of code completion abilities in large-scale code models, aiXcoder has developed an evaluation dataset larger than SantaCoder's, with over 16,000 samples sourced from diverse real-world development scenarios. This dataset features longer code contexts and greater diversity, closely resembling actual development projects. On this evaluation set, aiXcoder-7B continues to outperform other models, demonstrating superior performance.
Moreover, aiXcoder-7B showcases another significant highlight compared to other large code models: a tendency to produce shorter code solutions for user-specified tasks. During code completion evaluations targeting Java, C++, JavaScript, and Python programming languages, aiXcoder 7B Base not only exhibits the best performance but also generates answers significantly shorter in length than other models. These answers are notably close to the length of standard reference answers (Ref), as highlighted by red boxes across the board. Evaluation Result 3: aiXcoder-7B also performed exceptionally well in tasks involving code completion across multiple files, which are closer to real development scenarios. In the CrossCodeEval evaluation set, designed to assess the ability of large code models to extract context information across files, aiXcoder-7B achieved the best performance among models of its level. The evaluation results indicate that even when aiXcoder-7B only utilized results found through contextual information preceding the cursor as the prompt, while other models relied on GroundTruth results for their prompts, the former still outperformed the latter. In real-world development scenarios, the aiXcoder-7B model boasts numerous advantages, showcasing a blend of technological intelligence and aesthetics. For instance, its pre-training employs a context length of 32K tokens and can be expanded to 256K during inference, covering the vast majority of code in the entire development project. It accurately determines when new code needs to be generated or when code logic is complete and no further completion is necessary, thus directly generating complete code blocks, method bodies, and control flows. Moreover, it accurately extracts project-level contextual information, significantly reducing the illusions generated by large language models when predicting APIs.
Quality training data and targeted training methods are key:
In the realm of large models, there's a popular saying: "Garbage in, Garbage out," indicating that inputting poor-quality data leads to poor-quality results. It's evident that for large models to undergo pre-training, data quality is paramount. The stellar performance of the aiXcoder-7B model is primarily attributed to high-quality training data and targeted training methods.
The training set of the aiXcoder-7B model encompasses 1.2 trillion unique token data, spanning dozens of mainstream programming languages. The aiXcoder team meticulously analyzed the syntax of dozens of mainstream programming languages when constructing training data, filtering out erroneous code snippets. Additionally, they conducted static analysis on code from over ten mainstream languages, eliminating a total of 163 types of bugs and 197 common code defects, ensuring the high quality of the training data.
To enhance the model's ability to model code semantics and structure, the aiXcoder team employed various innovative strategies. On one hand, they utilized code clustering and function call graphs to capture mutual attention relationships between multiple files. On the other hand, they integrated structural information from abstract syntax trees into the pre-training task, aiding the model in learning syntax and pattern features of the code.
Overall, by handling higher-quality data and constructing pre-training tasks that are more aligned with development behaviors, we've found that aiXcoder-7B exhibits the best performance among current large code models, especially when considering real-world development scenarios.
"Out-of-the-box adaptability" for enterprise-level code models:
Firstly, it's easy to deploy. When deployed in actual enterprise environments, resources are typically limited. With only 7B parameters, aiXcoder-7B is easy to deploy, boasting low costs and excellent performance.
Secondly, it's easy to customize. Most enterprises have their own software development frameworks and libraries of APIs, along with associated business logic and code architecture standards tailored to their specific needs. All of these contents are highly personalized and often confidential. It's imperative for large models to learn these enterprise code assets through effective personalized training to truly serve the enterprise.
Thirdly, it's easy to combine. When providing enterprise services in the future, multiple 7B models will be combined into an MoE architecture to form a set of solutions tailored to enterprise needs. Different enterprises can obtain MoE-based code model solutions that meet their specific needs, allowing them to use the products while enjoying services tailored to their requirements.
Personalization is the biggest gap in the landing of enterprise-level code models in traditional industries. aiXcoder's "out-of-the-box adaptability" one-stop intelligent solution can provide precise, efficient, secure, and continuous software development services for enterprise-level users, improving project development efficiency and code quality.
Successful "checkpoint" for the reliability of aiXcoder code models:
Through a dual-cycle ecological layout strategy of "open source + closed source," aiXcoder 7B feeds back industry to enhance technology and expand its leading advantages. The enterprise-exclusive version is tailored for enterprise customers, collecting more real feedback about general models from a large number of C-end users and B-end developers, understanding practical effects and pain points, and transforming these into optimization points at the model and product levels, rapidly applying them to enterprise customers, continuously deepening B-end product capabilities and service quality, and expanding penetration in the enterprise market. aiXcoder-7B model has the advantage of being at least twice as efficient as other models, greatly reducing enterprise development costs.
For over a decade, aiXcoder has been a pioneer in commercial exploration of large-scale code models in China, leading the privatization of code enterprises and intelligent management. Currently, its main business focuses on the three core areas of privatized deployment, personalized training, and customized development of code models, providing tailored solutions for enterprise customers in a one-stop manner, ensuring exclusive and efficient services for application landing.
Many enterprise-level customers attach great importance to data security and privacy, and assets such as code cannot be uploaded to the cloud. How to achieve the best results with limited GPU resources has become the biggest pain point in enterprise privatized deployment. aiXcoder specializes in adapting models to domestic AI chips and low-end Nvidia graphics cards, with the earliest and best results. Whether it's domestic or imported hardware, it can receive optimal support and performance assurance. In addition, it also provides efficient and stable service guarantees for customers in areas such as model training and inference optimization.
According to the business needs of customers, aiXcoder provides personalized training methods, combining domain knowledge to carry out personalized training. Personalized training solutions can effectively improve the accuracy of the model, meeting specific requirements of customers in different industries and scenarios. Compared to homogeneous training solutions from other industry vendors, aiXcoder's personalized training solutions based on native large model technology offer higher flexibility and specificity.
aiXcoder focuses on integrating industry experience and professional domain knowledge accumulated over long-term service to enterprises into industrial practices, facilitating commercial landing. With years of deep cultivation in traditional key industries, the team has unique insights into these fields. By combining this expertise with customized development, aiXcoder will undoubtedly amplify the effectiveness of enterprise code models, achieving more with less effort. Currently, aiXcoder has served numerous top clients in industries such as banking, securities, insurance, defense, high technology, telecommunications, energy, and transportation, with a strong focus on the financial sector. Among these, the project "Application Practice of Large Code Models in the Securities Industry" with a leading securities enterprise has been honored with awards such as the 2023 AIIA Top Ten Potential AI Applications and the Outstanding Case of AI4SE Silver Bullet by the China Academy of Information and Communications Technology.
The journey of exploring software automation is advancing towards an unprecedented era of intelligence. With each significant breakthrough, the aiXcoder team is dedicated to creating software systems that are smarter, more efficient, secure, and reliable. We strive to be a key driver in the reliable integration of large models with traditional software. Moving forward, we will continue to forge ahead, providing developers with even more outstanding models and services!