1 |
From david@redhat.com Thu Aug 25 12:40:27 2022 |
2 |
From: David Hildenbrand <david@redhat.com> |
3 |
Date: Wed, 24 Aug 2022 21:23:33 +0200 |
4 |
Subject: mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW |
5 |
To: linux-kernel@vger.kernel.org |
6 |
Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Axel Rasmussen <axelrasmussen@google.com>, Nadav Amit <nadav.amit@gmail.com>, Peter Xu <peterx@redhat.com>, Hugh Dickins <hughd@google.com>, Andrea Arcangeli <aarcange@redhat.com>, Matthew Wilcox <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, John Hubbard <jhubbard@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, David Laight <David.Laight@ACULAB.COM>, stable@vger.kernel.org |
7 |
Message-ID: <20220824192333.287405-1-david@redhat.com> |
8 |
|
9 |
From: David Hildenbrand <david@redhat.com> |
10 |
|
11 |
commit 5535be3099717646781ce1540cf725965d680e7b upstream. |
12 |
|
13 |
Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know |
14 |
that FOLL_FORCE can be possibly dangerous, especially if there are races |
15 |
that can be exploited by user space. |
16 |
|
17 |
Right now, it would be sufficient to have some code that sets a PTE of a |
18 |
R/O-mapped shared page dirty, in order for it to erroneously become |
19 |
writable by FOLL_FORCE. The implications of setting a write-protected PTE |
20 |
dirty might not be immediately obvious to everyone. |
21 |
|
22 |
And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set |
23 |
pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map |
24 |
a shmem page R/O while marking the pte dirty. This can be used by |
25 |
unprivileged user space to modify tmpfs/shmem file content even if the |
26 |
user does not have write permissions to the file, and to bypass memfd |
27 |
write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590). |
28 |
|
29 |
To fix such security issues for good, the insight is that we really only |
30 |
need that fancy retry logic (FOLL_COW) for COW mappings that are not |
31 |
writable (!VM_WRITE). And in a COW mapping, we really only broke COW if |
32 |
we have an exclusive anonymous page mapped. If we have something else |
33 |
mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), |
34 |
we have to trigger a write fault to break COW. If we don't find an |
35 |
exclusive anonymous page when we retry, we have to trigger COW breaking |
36 |
once again because something intervened. |
37 |
|
38 |
Let's move away from this mandatory-retry + dirty handling and rely on our |
39 |
PageAnonExclusive() flag for making a similar decision, to use the same |
40 |
COW logic as in other kernel parts here as well. In case we stumble over |
41 |
a PTE in a COW mapping that does not map an exclusive anonymous page, COW |
42 |
was not properly broken and we have to trigger a fake write-fault to break |
43 |
COW. |
44 |
|
45 |
Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e |
46 |
("mm/mprotect: try avoiding write faults for exclusive anonymous pages |
47 |
when changing protection") and commit 76aefad628aa ("mm/mprotect: fix |
48 |
soft-dirty check in can_change_pte_writable()"), take care of softdirty |
49 |
and uffd-wp manually. |
50 |
|
51 |
For example, a write() via /proc/self/mem to a uffd-wp-protected range has |
52 |
to fail instead of silently granting write access and bypassing the |
53 |
userspace fault handler. Note that FOLL_FORCE is not only used for debug |
54 |
access, but also triggered by applications without debug intentions, for |
55 |
example, when pinning pages via RDMA. |
56 |
|
57 |
This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are |
58 |
affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR. |
59 |
|
60 |
Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So |
61 |
let's just get rid of it. |
62 |
|
63 |
Thanks to Nadav Amit for pointing out that the pte_dirty() check in |
64 |
FOLL_FORCE code is problematic and might be exploitable. |
65 |
|
66 |
Note 1: We don't check for the PTE being dirty because it doesn't matter |
67 |
for making a "was COWed" decision anymore, and whoever modifies the |
68 |
page has to set the page dirty either way. |
69 |
|
70 |
Note 2: Kernels before extended uffd-wp support and before |
71 |
PageAnonExclusive (< 5.19) can simply revert the problematic |
72 |
commit instead and be safe regarding UFFDIO_CONTINUE. A backport to |
73 |
v5.19 requires minor adjustments due to lack of |
74 |
vma_soft_dirty_enabled(). |
75 |
|
76 |
Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com |
77 |
Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte") |
78 |
Signed-off-by: David Hildenbrand <david@redhat.com> |
79 |
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
80 |
Cc: Axel Rasmussen <axelrasmussen@google.com> |
81 |
Cc: Nadav Amit <nadav.amit@gmail.com> |
82 |
Cc: Peter Xu <peterx@redhat.com> |
83 |
Cc: Hugh Dickins <hughd@google.com> |
84 |
Cc: Andrea Arcangeli <aarcange@redhat.com> |
85 |
Cc: Matthew Wilcox <willy@infradead.org> |
86 |
Cc: Vlastimil Babka <vbabka@suse.cz> |
87 |
Cc: John Hubbard <jhubbard@nvidia.com> |
88 |
Cc: Jason Gunthorpe <jgg@nvidia.com> |
89 |
Cc: David Laight <David.Laight@ACULAB.COM> |
90 |
Cc: <stable@vger.kernel.org> [5.16] |
91 |
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
92 |
Signed-off-by: David Hildenbrand <david@redhat.com> |
93 |
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
94 |
--- |
95 |
include/linux/mm.h | 1 |
96 |
mm/gup.c | 69 ++++++++++++++++++++++++++++++++++++----------------- |
97 |
mm/huge_memory.c | 65 +++++++++++++++++++++++++++++++++---------------- |
98 |
3 files changed, 91 insertions(+), 44 deletions(-) |
99 |
|
100 |
--- a/include/linux/mm.h |
101 |
+++ b/include/linux/mm.h |
102 |
@@ -2939,7 +2939,6 @@ struct page *follow_page(struct vm_area_ |
103 |
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ |
104 |
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ |
105 |
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ |
106 |
-#define FOLL_COW 0x4000 /* internal GUP flag */ |
107 |
#define FOLL_ANON 0x8000 /* don't do file mappings */ |
108 |
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ |
109 |
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ |
110 |
--- a/mm/gup.c |
111 |
+++ b/mm/gup.c |
112 |
@@ -478,14 +478,43 @@ static int follow_pfn_pte(struct vm_area |
113 |
return -EEXIST; |
114 |
} |
115 |
|
116 |
-/* |
117 |
- * FOLL_FORCE can write to even unwritable pte's, but only |
118 |
- * after we've gone through a COW cycle and they are dirty. |
119 |
- */ |
120 |
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) |
121 |
+/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */ |
122 |
+static inline bool can_follow_write_pte(pte_t pte, struct page *page, |
123 |
+ struct vm_area_struct *vma, |
124 |
+ unsigned int flags) |
125 |
{ |
126 |
- return pte_write(pte) || |
127 |
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); |
128 |
+ /* If the pte is writable, we can write to the page. */ |
129 |
+ if (pte_write(pte)) |
130 |
+ return true; |
131 |
+ |
132 |
+ /* Maybe FOLL_FORCE is set to override it? */ |
133 |
+ if (!(flags & FOLL_FORCE)) |
134 |
+ return false; |
135 |
+ |
136 |
+ /* But FOLL_FORCE has no effect on shared mappings */ |
137 |
+ if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED)) |
138 |
+ return false; |
139 |
+ |
140 |
+ /* ... or read-only private ones */ |
141 |
+ if (!(vma->vm_flags & VM_MAYWRITE)) |
142 |
+ return false; |
143 |
+ |
144 |
+ /* ... or already writable ones that just need to take a write fault */ |
145 |
+ if (vma->vm_flags & VM_WRITE) |
146 |
+ return false; |
147 |
+ |
148 |
+ /* |
149 |
+ * See can_change_pte_writable(): we broke COW and could map the page |
150 |
+ * writable if we have an exclusive anonymous page ... |
151 |
+ */ |
152 |
+ if (!page || !PageAnon(page) || !PageAnonExclusive(page)) |
153 |
+ return false; |
154 |
+ |
155 |
+ /* ... and a write-fault isn't required for other reasons. */ |
156 |
+ if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && |
157 |
+ !(vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte)) |
158 |
+ return false; |
159 |
+ return !userfaultfd_pte_wp(vma, pte); |
160 |
} |
161 |
|
162 |
static struct page *follow_page_pte(struct vm_area_struct *vma, |
163 |
@@ -528,12 +557,19 @@ retry: |
164 |
} |
165 |
if ((flags & FOLL_NUMA) && pte_protnone(pte)) |
166 |
goto no_page; |
167 |
- if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) { |
168 |
- pte_unmap_unlock(ptep, ptl); |
169 |
- return NULL; |
170 |
- } |
171 |
|
172 |
page = vm_normal_page(vma, address, pte); |
173 |
+ |
174 |
+ /* |
175 |
+ * We only care about anon pages in can_follow_write_pte() and don't |
176 |
+ * have to worry about pte_devmap() because they are never anon. |
177 |
+ */ |
178 |
+ if ((flags & FOLL_WRITE) && |
179 |
+ !can_follow_write_pte(pte, page, vma, flags)) { |
180 |
+ page = NULL; |
181 |
+ goto out; |
182 |
+ } |
183 |
+ |
184 |
if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { |
185 |
/* |
186 |
* Only return device mapping pages in the FOLL_GET or FOLL_PIN |
187 |
@@ -967,17 +1003,6 @@ static int faultin_page(struct vm_area_s |
188 |
return -EBUSY; |
189 |
} |
190 |
|
191 |
- /* |
192 |
- * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when |
193 |
- * necessary, even if maybe_mkwrite decided not to set pte_write. We |
194 |
- * can thus safely do subsequent page lookups as if they were reads. |
195 |
- * But only do so when looping for pte_write is futile: in some cases |
196 |
- * userspace may also be wanting to write to the gotten user page, |
197 |
- * which a read fault here might prevent (a readonly page might get |
198 |
- * reCOWed by userspace write). |
199 |
- */ |
200 |
- if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE)) |
201 |
- *flags |= FOLL_COW; |
202 |
return 0; |
203 |
} |
204 |
|
205 |
--- a/mm/huge_memory.c |
206 |
+++ b/mm/huge_memory.c |
207 |
@@ -978,12 +978,6 @@ struct page *follow_devmap_pmd(struct vm |
208 |
|
209 |
assert_spin_locked(pmd_lockptr(mm, pmd)); |
210 |
|
211 |
- /* |
212 |
- * When we COW a devmap PMD entry, we split it into PTEs, so we should |
213 |
- * not be in this function with `flags & FOLL_COW` set. |
214 |
- */ |
215 |
- WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); |
216 |
- |
217 |
/* FOLL_GET and FOLL_PIN are mutually exclusive. */ |
218 |
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == |
219 |
(FOLL_PIN | FOLL_GET))) |
220 |
@@ -1349,14 +1343,43 @@ fallback: |
221 |
return VM_FAULT_FALLBACK; |
222 |
} |
223 |
|
224 |
-/* |
225 |
- * FOLL_FORCE can write to even unwritable pmd's, but only |
226 |
- * after we've gone through a COW cycle and they are dirty. |
227 |
- */ |
228 |
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) |
229 |
+/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */ |
230 |
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, |
231 |
+ struct vm_area_struct *vma, |
232 |
+ unsigned int flags) |
233 |
{ |
234 |
- return pmd_write(pmd) || |
235 |
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); |
236 |
+ /* If the pmd is writable, we can write to the page. */ |
237 |
+ if (pmd_write(pmd)) |
238 |
+ return true; |
239 |
+ |
240 |
+ /* Maybe FOLL_FORCE is set to override it? */ |
241 |
+ if (!(flags & FOLL_FORCE)) |
242 |
+ return false; |
243 |
+ |
244 |
+ /* But FOLL_FORCE has no effect on shared mappings */ |
245 |
+ if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED)) |
246 |
+ return false; |
247 |
+ |
248 |
+ /* ... or read-only private ones */ |
249 |
+ if (!(vma->vm_flags & VM_MAYWRITE)) |
250 |
+ return false; |
251 |
+ |
252 |
+ /* ... or already writable ones that just need to take a write fault */ |
253 |
+ if (vma->vm_flags & VM_WRITE) |
254 |
+ return false; |
255 |
+ |
256 |
+ /* |
257 |
+ * See can_change_pte_writable(): we broke COW and could map the page |
258 |
+ * writable if we have an exclusive anonymous page ... |
259 |
+ */ |
260 |
+ if (!page || !PageAnon(page) || !PageAnonExclusive(page)) |
261 |
+ return false; |
262 |
+ |
263 |
+ /* ... and a write-fault isn't required for other reasons. */ |
264 |
+ if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && |
265 |
+ !(vma->vm_flags & VM_SOFTDIRTY) && !pmd_soft_dirty(pmd)) |
266 |
+ return false; |
267 |
+ return !userfaultfd_huge_pmd_wp(vma, pmd); |
268 |
} |
269 |
|
270 |
struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, |
271 |
@@ -1365,12 +1388,16 @@ struct page *follow_trans_huge_pmd(struc |
272 |
unsigned int flags) |
273 |
{ |
274 |
struct mm_struct *mm = vma->vm_mm; |
275 |
- struct page *page = NULL; |
276 |
+ struct page *page; |
277 |
|
278 |
assert_spin_locked(pmd_lockptr(mm, pmd)); |
279 |
|
280 |
- if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) |
281 |
- goto out; |
282 |
+ page = pmd_page(*pmd); |
283 |
+ VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); |
284 |
+ |
285 |
+ if ((flags & FOLL_WRITE) && |
286 |
+ !can_follow_write_pmd(*pmd, page, vma, flags)) |
287 |
+ return NULL; |
288 |
|
289 |
/* Avoid dumping huge zero page */ |
290 |
if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) |
291 |
@@ -1378,10 +1405,7 @@ struct page *follow_trans_huge_pmd(struc |
292 |
|
293 |
/* Full NUMA hinting faults to serialise migration in fault paths */ |
294 |
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) |
295 |
- goto out; |
296 |
- |
297 |
- page = pmd_page(*pmd); |
298 |
- VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); |
299 |
+ return NULL; |
300 |
|
301 |
if (!pmd_write(*pmd) && gup_must_unshare(flags, page)) |
302 |
return ERR_PTR(-EMLINK); |
303 |
@@ -1398,7 +1422,6 @@ struct page *follow_trans_huge_pmd(struc |
304 |
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; |
305 |
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page); |
306 |
|
307 |
-out: |
308 |
return page; |
309 |
} |
310 |
|