GitHub Copilot

Hare are some notes about GitHub Copilot. The concepts were extracted from documentation and Udemy training.

Series: Foundation> Copilot> Actions Overview> Dive in Actions

Definition

GitHub Copilot is an AI coding assistant that helps you write code faster and with less effort, allowing you to focus more energy on problem solving and collaboration[What is][QuickStart][Immersive view]

Machine Learn

Supervised: input label - e.g algorithm: classification, Regression
Unsupervised: without input label - e.g. algorithm: clustering
Reinforced: feedback - e.g. algorithm: Decision making
Process: input text -> tokenization([create, a, file]) -> embedding generation (each token is converted into a vector embbedings) -> model processing

Characteristics

Probabilistic: may generate different outputs for the same input
Coding related questions
Primary English
Uses OpenAI’s Codex model, a machine-learning model derived from GPT-3
Can generate source code, documentation, git ignore, commit messages, unit test
It is available in IDE, GH Mobile, command line and Github.com (only Enterprise)

Features [1][2]

Code suggestions
Code Review
Understand the context of the code
Multi language support
Inteligent debugging
Code refactoring
Security assistence
Writing documentation
Autocomple
Automate the creation of projects and related directories
Chat
Support in the CLI
- Generate shell commands by describing intent in plain language
- Suggesting shell commands based on natural language input, helping with syntax and automating repetitive tasks
- Install GitHub CLI (gh)
  1. Authenticate with 'gh auth login'
  2. Run 'gh extension install github/gh-copilot'
  3. 'gh extension install github/copilot-cli' (integrates Copilot CLI into the GitHub CLI ecosystem)
  4. Test if is correctly installed: 'gh extension list'
AI-generated PR summaries (only for Enterprise). It analyzes the diff of the code changes, the commit history, and comments
Refactor, migrate a project, write tests, modernize legacy code, upgrading java projects, Choosing the right AI tool for your task

Training

It is trained on all languages that appear in public repositories (including open-source repositories). The quality of suggestions depends on the volume and diversity of training data for each language
Largin training Dataset in public repo > Neural Network Arch based on transformer in unsupervised leraning (learned pattern and struture without label) > use Supervised learn for during fine tuning process (learn from examples helps to understand the context and improve the accuracy) > outcome: Codex model (descendent of GptTree; based on transformer archtecture)
GitHub Copilot’s model is trained on a static dataset that includes publicly available code and is not updated in real-time. The suggestions may be outdated because of some old or deprecated code
Proprietary code from private repositories is explicitly excluded from GitHub Copilot’s training dataset to protect user privacy and ensure compliance with ethical standards.

How it works

Transmits the code and its surrounding context to a large language model (e.g. Codex), which is hosted remotely (cloud-based models)
Only the code currently working on ( and metadata), such as a few lines or function definitions, is sent to OpenAI’s servers for processing. The model generates suggestions based on the context actively edited, without needing access the entire project or any other unrelated data
(1) The request is sent to GitHub Copilot's servers (anonymized, encrypted data), (2) forwarded to a proxy server that pre-processes the data (such as context and completion suggestions). The proxy filters user inputs, removing sensitive or personally identifiable information before sending the data to the cloud-based model, which means the suggestions are based on a sanitized version of the data, without leaking private code or sensitive data. (3) Then pass to the model. Once the model generates a suggestion, (4) it undergoes post-processing through the proxy server before being (5) sent back to your IDE
It sends small snippets of code (a few lines around the cursor) to GitHub’s servers, where an AI model processes the data and generates relevant code completions. These snippets are temporarily processed to provide real-time suggestions
No user-specific data is stored or logged persistently. The processing is done in memory and the data is discarded once the suggestions are generated
Duplication:
- GitHub Copilot is designed to avoid suggesting code that matches more than 150 characters from any single block in publicly available repositories
- The duplication detector filter compares generated code suggestions against a database of popular, open-source repositories and flags suggestions if a certain similarity threshold is met.
- The duplication filter prevents Copilot from suggesting exact matches to publicly available code, ensuring that its suggestions are not direct copies of existing repositories. However, it does not completely eliminate similar code from being suggested.
Telemetry:
- GitHub Copilot collects anonymized telemetry data, which may be shared with Microsoft and GitHub to improve services.
- Telemetry data may be collected, and snippets of user code can be logged for debugging purposes
- Telemetry collection from Copilot can offer insights into how much time is saved, how often suggestions are accepted, and even areas where Copilot may not be helpful
- The GitHub Productivity API allows teams to track how frequently developers accept Copilot’s suggestions
Since Copilot is trained on vast datasets of publicly available code, it tends to generate widely used, conventional coding patterns. This makes it useful for common programming tasks but less innovative for highly specific, custom, or domain-specific solutions. Developers may need to modify or refine the suggestions to better fit unique use cases.
Regularly reviewing AI-generated code for bias is a critical step in responsible AI use
Using Copilot in private mode limits the scope of suggestions and reduces the risk of sensitive data leakage. Excluding sensitive files from Copilot’s context is a safeguard that ensures Copilot doesn’t suggest code based on sensitive internal data. This is a best practice for maintaining privacy while still using Copilot's benefits.
GitHub Copilot generates suggestions from a model trained on publicly available code, which uses pattern recognition and natural language processing (NLP) to match your input to relevant code. It uses natural language processing (NLP) and pattern recognition techniques to identify relevant code based on the context of the user's current input, generating suggestions that match the programming style and libraries in use.
GitHub Copilot uses a limited context window to process a portion of the code at a time. The suggestions are based on the code within this window, which means that in longer files or complex codebases, suggestions may not fully reflect the overall structure or intent of the project because parts of the code fall outside the context window.
Pressing Ctrl+Space manually requests GitHub Copilot to generate code suggestions in supported IDEs like Visual Studio Code. This works in scenarios where Copilot does not auto-suggest completions but the user still wants assistance.
GitHub Copilot Chat prioritizes inline comments and existing code context when generating responses
GitHub Copilot provides multiple code suggestions when invoked via the appropriate keyboard shortcut (e.g., pressing Ctrl+Enter or Alt+] in some IDEs)

Chat

Inputs are sent directly to OpenAI’s Codex API for processing
You can choose the AI models[1]
Coding-related questions, explanations for code snippets, debugging help, and real-time code suggestions based on current coding environment [1]
Prompt engineering: Start general, then get specific; give examples; Break complex tasks into simpler tasks; Avoid ambiguity; Indicate relevant code; Experiment and iterate; Keep history relevant; Follow good coding practices [1][2][3][4][5]
Very effective for helping developers understand unfamiliar code by explaining its functionality, dependencies, and usage
Includes built-in feedback mechanisms (rate by clicking thumbs-up or thumbs-down buttons)
Feedback about GitHub Copilot Chat is through the in-editor feedback button, which allows users to send context-specific feedback while using Copilot
Builds a prompt by extracting relevant portions of the currently open file, taking into account the user’s cursor position, function signatures, surrounding comments, and contextual code
it cannot execute code directly within the chat interface
The mobile has the chat feature, but with some limitations of quality
Edit mode is use for more granular control: choose files to let Copilot make changes[1]
Chat for debbuging: developer describe issues in natural language, and based on the context of the surrounding code, it provides suggestions for how to address potential bugs
Improve GitHub response relevance and performance in large codebases limiting the number of open files or tabs in the editor
GitHub Copilot Chat is available for both public and private repositories for individual and enterprise plans, with the necessary permissions.

Agent [1]

Copilot can work like a developer: fix bugs, implement features, create PR, etc [1]

Subscription [1][2]

Non-GitHub users can access GitHub Copilot through Microsoft Visual Studio and Visual Studio Code if they have an Azure subscription
GitHub Copilot requires users to have a GitHub account for authentication and subscription purposes. However, it does not require users to host their repositories on GitHub
Free: code completion (2k line/month), chat (50/month), block suggestion, access to Claude Sonnet and ChatGPT model
Pro: code completion [no limit], chat[no limit], chat in GH Mobile
Pro+: PRO + Full access to all available models in Copilot Chat; Up to 1,500 premium requests per month; Priority access to advanced AI capabilities
Business:
- All previous + file exclusion, organization wide policy, audit logs, support for public and private repositories, manage policies at the enterprise or organization level
- Enables admin-level control, telemetry for auditing, and organization-wide enforcement of code matching filters, which scan for snippets that closely match public GitHub content—crucial for enterprise risk mitigation
- Allows the organization to configure the service to meet company-wide policies and exclude specific files from being evaluated
- Orgaization-wide policies like disable suggestions matching public code
- Designed for organizations that need data privacy and security features
- Enterprise-grade security practices, including end-to-end encryption of data in transit and at rest
- GitHub Copilot Business is designed with enterprise clients in mind, offering key security and compliance features such as SOC 2 Type 2 and GDPR compliance
- It has the ability to restrict AI-generated code suggestions based on organization policies
- Provides an IP (intellectual property) indemnity clause that covers claims against the generated code in certain scenarios
- SSO integration
- Private code generation
- Vulnerability scanning
- The "block matching public code" feature to support copyright or licensing risks. Copilot does not generate code suggestions that match public repositories
- REST API:
  - Automate the management of subscriptions using GitHub’s REST API to list, add, and remove GitHub Copilot seats for users in organization
  - Manage users and assign Copilot seats, it is not role-based
  - Endpoint to manage subscription for a user in an organization: POST /orgs/{org}/copilot/seats
  - Remove a user: the administrator must call the /orgs/{org}/copilot/billing/seats/{username} endpoint with a DELETE request.
  - The admin can get the list of subscription by the endpoint GET /orgs/{org}/copilot/subscriptions using the API token scope as admin:org. OAuth2 token is necessary to provide the required authentication for organizational access
Enterprise:
- All the Business + Copilot Knowladge bases (improve accuracy), fine tuning a custom LLM
- Knowladge base: dedicated repository that holds all the relevant documentation, code, and libraries; make the contents available for enhanced coding suggestions, ensuring that organization-specific practices are reflected in Copilot’s output. The most useful types of knowledge are stored: code snippets, standardized functions, and reusable components from internal repositories
- Provides the ability to manage licenses and users at scale
- Centralized administrative controls (license management, security policies, billing)
- Analyzes commit messages, file changes, and project context to generate a concise pull request summary
- Best option for large organizations with strict privacy and security concerns
- It includes advanced privacy controls, like the ability to configure context exclusions, enforce corporate policy integration, and ensure that sensitive codebases are handled securely. Also allows admins configure policies at the organization level to specify which repositories are enabled for Copilot
- GitHub Copilot Enterprise includes features like enhanced telemetry and more granular data retention policies. Telemetry data allows to monitor usage while complying with internal security policies

Access

GitHub Copilot Individual is designed for single-user environments and does not include team management features like access control. It is a member of an Organization with subscription.
Free Copilot Pro: student, teacher or maintainer of a popular OSS project

Configuration

'.copilot' can be used to disabled suggestions
'copilot.yaml' can be used to configure content exclusion
'GitHub Copilot editor config' to ensure that code suggestions from Copilot do not incorporate sensitive internal code: use the exclude directive ("exclude": true) in the Copilot editor config file for directories or files
The "excludePatterns" directive in the editor config file to exclude private repositories or directories from Copilot completions ensure that specific directories, files, or repositories are excluded from GitHub Copilot’s completions, protecting proprietary or internal code from being suggested
Exclude sensitive content within a repository: Enable "Copilot Exclusion Rules" in the repository's settings, specifying the files and directories containing sensitive information

Security

Audit Logs: track Copilot usage at a high level (organizational level - user and admin activities related to GitHub Copilot) such as when users enable or disable Copilot in their settings, subscription updates, unauthorized access to GitHub Copilot
Search for audit events: filters or search queries can tracking administrative actions on Copilot access, e.g, action:copilot.access_enabled to identify events where Copilot access was granted or enabled for organization members
Admins can search for Copilot-related events by using the query "copilot" within the GitHub organization audit log, filtering results by actor, event type, and date range
GitHub provides options to configure repository-level exclusion rules to prevent sensitive files and directories from being accessed by GitHub Copilot
If a policy is applied at enterprise level, all organizations within the enterprise will inherit the policiy

Developer Responsiblity [1]

Developers should review AI-generated code for security vulnerabilities, licenses, and ensure compliance with internal policies before using it in production. It can mitigate copyright violations and security risks
The code generated by copilot is available to be used, modified, and distribute by developers as if they had written it manually
Copilot can generate code that is similar to existing public code, but it does not track or enforce license compliance. Developers must verify that
Bias: the developer is respobsible for review code to avoid biased suggestions and make appropriate edits to ensure that the open-source project remains inclusive and avoids reinforcing harmful stereotypes